Deep Dive 9 · Technique

Harness Engineering

Context engineering decides what the model can see. Harness engineering decides what happens around every answer — the tools it can use, the check that says "this is actually right," and the loop that runs until it is.

The model is the engine. The harness is the car. A world-class engine bolted to nothing still goes nowhere.

Where it sits

Three disciplines, stacked. Most teams stop at the first one and wonder why their agents impress in a demo and collapse in production.

💬

Prompt craft

Tune the wording of one request until it behaves.

🧱

Context engineering

Tune what the model sees when it answers. → Deep Dive 5

🏗️

Harness engineering

Tune the loop it runs inside — the tools it calls, the checks it must pass, what it's allowed to do, and how it recovers when it fails — so it doesn't just answer, it finishes.

Same model. Up to six times the output.

This isn't a soft idea. Hand the identical model to two teams and the one with the better harness can be multiples more productive — the intelligence is the same; the system around it isn't. It's why "we have access to GPT/Claude" and "we get repeatable value from it" are two very different sentences.

6×

performance swing between the best and worst harness around the same model

Stanford & Tsinghua study, 2025

potential lift to global GDP from AI — still mostly unrealized, for lack of a system layer

Goldman Sachs

11%

of AI pilots reach production; the rest have models but no harness to hold them up

Industry, 2025

The heart of a harness is the verifier

An agent without a verifier is a confident guess generator. So before you automate anything, ask one question — not "should we put a loop around this?" but "where's the crisp verifier?" The thing that says pass or fail without a human squinting at it: a test that exits green, a schema that validates, a smoke check that loads the page.

✅

"Done" is checkable

Write the check first. Then a loop is safe — it runs until the check passes, and you can trust the result. Image backfills, schema audits, render pipelines, anything with a pass/fail exit code.

🎨

"Done" is subjective

A logo, a video script, a brand voice — there's no exit code for taste. Don't wrap a loop around fuzzy judgment. Keep a human in the seat, or write a real rubric first.

The loop isn't the innovation — the verifier is. An agent is only ever as good as the check it has to pass. Get that check right and the loop becomes trustworthy; skip it and you've just automated being confidently wrong, faster.

The 6 parts of a harness

Everything that turns a clever response into finished, trusted work falls into six buckets. Production systems build all six; demos build none.

🔧

Tools

Scoped capability — read a file, run a test, query a DB, call an API. What the agent can actually do. See MCP.

✅

The verifier

One command that exits green or red. The heart. The agent records proof — it never grades its own homework.

🔁

The feedback loop

Red → fix → green, automatically, until the check passes. This is the part everyone calls "agentic." It only works because of the verifier.

📐

Definition of done

Three layers: it parses → it runs → it works in the whole system. "Code written" is not done. "Verification passed" is.

🧱

Conventions & memory

A checked-in list of features and the proof each one works, one idempotent startup command, and lessons that graduate into permanent checks.

🚦

Guardrails & escalation

What the agent must not do, and exactly when to stop and hand back to a human. Speed is only safe with brakes.

Where agents die without one

Every failed AI rollout we've seen traces back to a missing piece of harness — not a model that wasn't smart enough.

🙈

Self-declared done

"Looks good!" with no proof. The agent says it shipped; nobody checked. Works until the day it doesn't.

📉

Silent rot

Great in the demo, degrades in production, and no one notices for weeks. The fix is a baseline you measure against. → Evals

🚀

The 11% problem

79% of teams have an AI pilot; only 11% reach production — because there's no harness holding it up. → Why pilots die

Why it matters for your business

A harness is the difference between a demo and a system you'd trust at 2 a.m. It's cheap — no retraining, no new model — and it's the discipline behind every agent that survives real users. The teams winning with AI aren't running it unattended on vibes; they're running disciplined, verifier-backed loops. They just don't always call them "agents." That's the bar, and it's learnable.

Context engineering tunes the input; harness engineering tunes the loop. Together they're how teams cross from impressive demos to agents that actually ship. We turned ours into six operator habits — Get your agents to actually finish →

Where this is going: harnesses that tune themselves

The frontier is agents that improve their own harness. Researchers (Microsoft and City University of Hong Kong) call it retrospective harness optimization — the agent studies its own past runs, finds where it went wrong, and proposes updates to its own rules and tools. On a hard coding benchmark, that pushed scores from 0.59 to 0.78 with no human grader in the loop.

Notice why it works: it rests on the agent self-validating (did I actually finish?) and checking self-consistency (do my attempts agree?). Same lesson as ever — an agent is only as good as the checks it runs against itself. And the moment one can edit its own harness, you need the unglamorous parts more, not less: audit logs, rollback paths, and a human approval gate, so it can't quietly reinforce a bad habit.

The whole picture

Harness Engineering infographic — the model is the engine, the harness is the car. The 6 parts of a harness: tools, verifier, feedback loop, definition of done, conventions and memory, guardrails. Same model, up to 6x the output.

Your agent doesn't need to be smarter. It needs a harness.

Most underperforming AI doesn't need a bigger model — it needs a verifier, a definition of done, and a loop that respects both. That's the highest-leverage engineering in AI right now, and it's where reliable systems are won.

Assess your team's stage →

← Workshop Hub