Skip to main content
Progressical

Progressical methodology

What a harness is, how it fails, and how we engineer it.

Written for senior engineers, AI platform leads, and technical buyers who want to understand the work before they buy it.

Definition

The harness, defined precisely

Retrieval

Which documents, examples, prior turns, or records get pulled in to ground the model response.

Prompt assembly

How retrieved content, system instructions, user input, and metadata become the final prompt.

Conversation memory

What gets remembered across turns, what gets summarized, and what gets dropped as context fills.

Frozen model

LLM

Tools

Which tools the model can call, when those calls are valid, and how tool results flow back.

Output validation

Whether the response is structurally valid, semantically sensible, and safe to return.

Retries and fallbacks

What happens when retrieval, tools, validation, or the model response itself fails.

The harness is the executable layer around the model.

A harness is the executable scaffolding around a frozen LLM in a production application. It is the operational reality of the AI feature: the code that decides what the model sees, which tools it can use, how history is managed, and what happens when the output is malformed or wrong.

Most production AI features have all six layers in some form. Most have at least three of them in a state we would describe as shipped and forgotten.

Research anchor

The signal-loss finding

The Meta-Harness paper reported a result that motivates this work: an optimization loop with access to scores plus raw execution traces reached 56.7% best accuracy, while scores plus summaries reached 38.7%. The summarized version performed worse than scores alone.

The lesson is not that every system needs more tokens. It is that compression can remove the exact details the model or engineer needed. Production harnesses lose signal the same way: over-broad chunking, aggressive memory summaries, prompt templates that bury critical facts, and validation layers that discard nuance.

Audit

How a two-week audit runs

Days 1-2

Scope and instrumentation

We agree on the feature, metric, and user segment. Then we get visibility into traces, prompts, retrieved context, tool calls, outputs, and the relevant code paths. If traces do not exist yet, instrumentation comes first.

Days 3-8

Trace analysis

We sample production interactions and follow each one through the harness, looking for where context is added, dropped, summarized, transformed, or validated away.

Days 9-11

Failure-mode taxonomy

We classify failures by harness layer and mechanism, rank them by estimated cost to the target metric, and identify the few issues most likely to explain the gap.

Days 12-14

Report and roadmap

You receive a written audit with examples, recommended remediations, rough effort estimates, and a prioritized roadmap. If we do not find at least three meaningful issues to fix, the audit is free.

Rebuild

How a rebuild runs

The rebuild starts where the audit ends. We convert a curated subset of failure cases plus successful cases into a 200-500 item evaluation set, then rebuild the harness layers with the highest failure-mode cost.

That usually means sharper retrieval, clearer prompt assembly, memory that retains salient raw detail, structured output enforcement, and graceful fallbacks. We deliver the changes as a PR or equivalent artifact with an A/B plan your team can run.

If the rebuilt harness does not beat the existing one on the agreed metric, the remediation portion is not billed.

Operations

How ongoing operations works

Once the harness has been rebuilt and the eval set exists, ongoing operations becomes concrete: scheduled eval runs, regression alerts when prompts or models change, monitoring for latency and cost shifts, and quarterly tune-ups as user behavior drifts.

Operations starts at $5,000/month. Some teams keep us involved; others take the eval set and run the same monitoring themselves.

Point of view

What we believe

The harness is the artifact, not the model.

Teams often try to fix harness problems by swapping to a stronger model. Sometimes that helps, but usually at higher cost. A better harness on a cheaper model often beats a worse harness on a more expensive one.

Compression is not the same as engineering.

Naive summarization can destroy the signal the model needed. The better move is usually raw access plus better navigation: retrieval that actually retrieves, and memory that distinguishes salient facts from incidental detail.

Eval sets are the real moat.

A team with 500 graded cases from real users can move faster and safer than a team guessing from anecdotes. Most Progressical customers leave with their first useful eval set.

Methodology v1 · Q2 2026

If you have read this far, the next conversation is probably worth twenty minutes.

Start with an audit