Progressical methodology
What a harness is, how it fails, and how we engineer it.
Written for senior engineers, AI platform leads, and technical buyers who want to understand the work before they buy it.
Definition
The harness, defined precisely
Retrieval
Which documents, examples, prior turns, or records get pulled in to ground the model response.
Prompt assembly
How retrieved content, system instructions, user input, and metadata become the final prompt.
Conversation memory
What gets remembered across turns, what gets summarized, and what gets dropped as context fills.
Frozen model
LLM
Tools
Which tools the model can call, when those calls are valid, and how tool results flow back.
Output validation
Whether the response is structurally valid, semantically sensible, and safe to return.
Retries and fallbacks
What happens when retrieval, tools, validation, or the model response itself fails.
The harness is the executable layer around the model.
A harness is the executable scaffolding around a frozen LLM in a production application. It is the operational reality of the AI feature: the code that decides what the model sees, which tools it can use, how history is managed, and what happens when the output is malformed or wrong.
Most production AI features have all six layers in some form. Most have at least three of them in a state we would describe as shipped and forgotten.
Research anchor
The signal-loss finding
The Meta-Harness paper reported a result that motivates this work: an optimization loop with access to scores plus raw execution traces reached 56.7% best accuracy, while scores plus summaries reached 38.7%. The summarized version performed worse than scores alone.
The lesson is not that every system needs more tokens. It is that compression can remove the exact details the model or engineer needed. Production harnesses lose signal the same way: over-broad chunking, aggressive memory summaries, prompt templates that bury critical facts, and validation layers that discard nuance.
Audit
How a two-week audit runs
Days 1-2
Scope and instrumentation
We agree on the feature, metric, and user segment. Then we get visibility into traces, prompts, retrieved context, tool calls, outputs, and the relevant code paths. If traces do not exist yet, instrumentation comes first.
Days 3-8
Trace analysis
We sample production interactions and follow each one through the harness, looking for where context is added, dropped, summarized, transformed, or validated away.
Days 9-11
Failure-mode taxonomy
We classify failures by harness layer and mechanism, rank them by estimated cost to the target metric, and identify the few issues most likely to explain the gap.
Days 12-14
Report and roadmap
You receive a written audit with examples, recommended remediations, rough effort estimates, and a prioritized roadmap. If we do not find at least three meaningful issues to fix, the audit is free.
Rebuild
How a rebuild runs
The rebuild starts where the audit ends. We convert a curated subset of failure cases plus successful cases into a 200-500 item evaluation set, then rebuild the harness layers with the highest failure-mode cost.
That usually means sharper retrieval, clearer prompt assembly, memory that retains salient raw detail, structured output enforcement, and graceful fallbacks. We deliver the changes as a PR or equivalent artifact with an A/B plan your team can run.
If the rebuilt harness does not beat the existing one on the agreed metric, the remediation portion is not billed.
Operations
How ongoing operations works
Once the harness has been rebuilt and the eval set exists, ongoing operations becomes concrete: scheduled eval runs, regression alerts when prompts or models change, monitoring for latency and cost shifts, and quarterly tune-ups as user behavior drifts.
Operations starts at $5,000/month. Some teams keep us involved; others take the eval set and run the same monitoring themselves.
Point of view
What we believe
The harness is the artifact, not the model.
Teams often try to fix harness problems by swapping to a stronger model. Sometimes that helps, but usually at higher cost. A better harness on a cheaper model often beats a worse harness on a more expensive one.
Compression is not the same as engineering.
Naive summarization can destroy the signal the model needed. The better move is usually raw access plus better navigation: retrieval that actually retrieves, and memory that distinguishes salient facts from incidental detail.
Eval sets are the real moat.
A team with 500 graded cases from real users can move faster and safer than a team guessing from anecdotes. Most Progressical customers leave with their first useful eval set.
Methodology v1 · Q2 2026