Deep Dive 9 · Technique
Harness Engineering
Context engineering decides what the model can see. Harness engineering decides what happens around every answer — the tools it can use, the check that says "this is actually right," and the loop that runs until it is.
The model is the engine. The harness is the car. A world-class engine bolted to nothing still goes nowhere.
Where it sits
Three disciplines, stacked. Most teams stop at the first one and wonder why their agents impress in a demo and collapse in production.
Tune the wording of one request until it behaves.
Tune what the model sees when it answers. → Deep Dive 5
Tune the loop it runs inside — the tools it calls, the checks it must pass, what it's allowed to do, and how it recovers when it fails — so it doesn't just answer, it finishes.
Same model. Up to six times the output.
This isn't a soft idea. Hand the identical model to two teams and the one with the better harness can be multiples more productive — the intelligence is the same; the system around it isn't. It's why "we have access to GPT/Claude" and "we get repeatable value from it" are two very different sentences.
The heart of a harness is the verifier
An agent without a verifier is a confident guess generator. So before you automate anything, ask one question — not "should we put a loop around this?" but "where's the crisp verifier?" The thing that says pass or fail without a human squinting at it: a test that exits green, a schema that validates, a smoke check that loads the page.
Write the check first. Then a loop is safe — it runs until the check passes, and you can trust the result. Image backfills, schema audits, render pipelines, anything with a pass/fail exit code.
A logo, a video script, a brand voice — there's no exit code for taste. Don't wrap a loop around fuzzy judgment. Keep a human in the seat, or write a real rubric first.
The 6 parts of a harness
Everything that turns a clever response into finished, trusted work falls into six buckets. Production systems build all six; demos build none.
Scoped capability — read a file, run a test, query a DB, call an API. What the agent can actually do. See MCP.
One command that exits green or red. The heart. The agent records proof — it never grades its own homework.
Red → fix → green, automatically, until the check passes. This is the part everyone calls "agentic." It only works because of the verifier.
Three layers: it parses → it runs → it works in the whole system. "Code written" is not done. "Verification passed" is.
A checked-in list of features and the proof each one works, one idempotent startup command, and lessons that graduate into permanent checks.
What the agent must not do, and exactly when to stop and hand back to a human. Speed is only safe with brakes.
Where agents die without one
Every failed AI rollout we've seen traces back to a missing piece of harness — not a model that wasn't smart enough.
"Looks good!" with no proof. The agent says it shipped; nobody checked. Works until the day it doesn't.
Great in the demo, degrades in production, and no one notices for weeks. The fix is a baseline you measure against. → Evals
79% of teams have an AI pilot; only 11% reach production — because there's no harness holding it up. → Why pilots die
Why it matters for your business
A harness is the difference between a demo and a system you'd trust at 2 a.m. It's cheap — no retraining, no new model — and it's the discipline behind every agent that survives real users. The teams winning with AI aren't running it unattended on vibes; they're running disciplined, verifier-backed loops. They just don't always call them "agents." That's the bar, and it's learnable.
Where this is going: harnesses that tune themselves
The frontier is agents that improve their own harness. Researchers (Microsoft and City University of Hong Kong) call it retrospective harness optimization — the agent studies its own past runs, finds where it went wrong, and proposes updates to its own rules and tools. On a hard coding benchmark, that pushed scores from 0.59 to 0.78 with no human grader in the loop.
Notice why it works: it rests on the agent self-validating (did I actually finish?) and checking self-consistency (do my attempts agree?). Same lesson as ever — an agent is only as good as the checks it runs against itself. And the moment one can edit its own harness, you need the unglamorous parts more, not less: audit logs, rollback paths, and a human approval gate, so it can't quietly reinforce a bad habit.
The whole picture
Your agent doesn't need to be smarter. It needs a harness.
Most underperforming AI doesn't need a bigger model — it needs a verifier, a definition of done, and a loop that respects both. That's the highest-leverage engineering in AI right now, and it's where reliable systems are won.
Assess your team's stage →