Deep Dive · Discipline
Evals: How Do You Know It's Working?
Most teams ship AI on vibes — it felt good in the demo, so out it goes. Then it quietly degrades and nobody notices for weeks. An eval is just an answer to one question: how do you know it's working? If you can't answer that, you don't have a product — you have a guess.
The vibes trap
You can't improve what you don't measure, and you can't trust what you can't test. "It looked right" is the most expensive sentence in AI — it's how a system that broke three weeks ago is still in production today.
Spot-check a few outputs, feel good, ship. No idea when it regresses. Every change is a gamble.
A fixed set of cases with known-good answers. Every change gets scored. Regressions caught before users see them.
The four kinds of evals
A frozen set of inputs with the right answers. The bedrock — run it on every change to catch regressions.
A second model grades the first against a rubric. Scales human judgment to thousands of cases cheaply.
Did today's change break yesterday's wins? Pin the cases so "green" means working, not "green" means lucky.
Sampled, structured human grading on the cases that matter most. The ground truth your automated evals calibrate against.
Start with a baseline — it's the first win
Before you can prove AI helped, you have to record what "before" looked like. Capture today's numbers — accuracy, time-per-task, escalation rate — then measure after. You can't prove a lift you never recorded, and the act of baselining is itself the first deliverable: it tells you exactly where AI should hit hardest.
If you can't measure it, you can't trust it.
Evals turn AI from a demo you hope works into a system you can prove works — and improve on purpose. We build the baseline and the eval harness as the first phase of every engagement.
Build your baseline →