The business case for harness engineering
A better harness on a cheaper modeloften beats a worse harness on a more expensive one.
The Meta-Harness paper found 6× performance variance from harness changes alone — on the same model, same benchmark. That gap translates directly into cost and quality. Here's how.
Meta-Harness paper · Stanford / MIT · 2026
An optimization loop with access to scores plus raw execution traces reached 56.7% best accuracy. The version with scores plus summaries reached 38.7% — worse than scores alone.
Same model. Same benchmark. Different harness. The performance gap from harness changes alone was larger than the gap between generations of frontier models.
How harness engineering pays
Harness optimization works through two levers.
Lever 1 · Cost reduction
Cut LLM spend without cutting quality.
LLM API spend is primarily a function of context window usage. Most production harnesses include 20–40% of context that adds no quality signal — over-broad retrieval chunks, redundant conversation history, verbose system prompts assembled under deadline pressure.
A tuned harness reduces average token cost by 15–35% without measurable quality loss.
Example: a team spending $12,000/month often has $2,000–$4,000 of recoverable spend from over-broad retrieval and compression policies never revisited after launch.
Lever 2 · Quality improvement
Improve retention without changing the model.
The Meta-Harness paper measured what happens when you change only the harness — same model, same benchmark. The difference between the best and worst harness configuration was 6×.
That variance translates to product outcomes. A 2–5 percentage point improvement in 7-day retention is typical after a Rebuild on a consumer AI feature.
Sample result
7-day retention · consumer mental health app · pilot in progress
Estimate your numbers
Estimate your numbers.
Your numbers
Estimated outcomes
Conservative token savings
15% reduction in context window usage
Typical token savings
30% reduction in context window usage
Rebuild payback period
$35k ÷ savings midpoint
Conservative retention lift
+100 retained users/mo
Typical retention lift
+200 retained users/mo
Token savings: spend × 0.15 (conservative) to × 0.30 (typical). Payback: $35,000 ÷ monthly savings midpoint. Retention lift estimated from Progressical pilot data. Actual results depend on your harness, metric, and user segment.
What the engagement price buys
What's included.
Audit
$15,000Two weeks
Find out exactly where the money is going and why quality is lower than it should be. The prioritized fix list makes every subsequent decision cheaper.
Rebuild
$35,000Four weeks
Fix the highest-cost signal leaks. Includes the eval set that proves the fix worked and catches the next regression automatically.
Includes eval setOperations
$5,000/monthOngoing
Keep the improvements from drifting as models, traffic, and user behavior change.
How every engagement starts
Run a real audit, not a calculator.
The calculator gives you a rough order of magnitude. The audit gives you the actual numbers — grounded in your harness, your traces, your metric. If we don't find three things to fix, you don't pay.
Start with an audit