Deep Dive 10 · Strategy

Which AI Should We Use?

Claude or Codex? GPT or Gemini? One week a benchmark crowns a new king; the next week it flips. Here's the honest answer most vendors won't give you: for the work your company actually does, the model you pick matters far less than you think.

The labs will keep trading the lead, and the handful of things that truly decide whether AI works for you aren't on any leaderboard. Let's look at the real numbers — and the questions worth asking instead.

The question that never resolves

"Let's wait until there's a clear winner before we commit." It's the most reasonable-sounding plan in AI, and it's a trap. Two of the best-funded labs on earth are in a footrace — and footraces don't "level out." Whoever leads this month gets passed next month, on purpose, by design. If your strategy depends on the dust settling, your strategy is to wait forever.

The good news: you don't have to wait, because the thing everyone's racing over has quietly stopped being the thing that matters.

The models have already converged

The frontier used to be a cliff. Now it's a crowd. On the public benchmarks everyone quotes, the top models are bunched so tightly the differences are almost noise:

~3 pts
the entire spread across the top ten models on MMLU (87.2–90.1) — a rounding error apart
Frontier model trackers, Jan 2026
>90%
where most frontier models already score on standard coding benchmarks — clustered at the ceiling
HumanEval, 2025
2027
the year analysts expect frontier models to be priced and treated like commodities, not moats
Industry analysis, 2025

The gaps that make headlines live at the fringe — exotic multi-step reasoning, training-time tricks, inference cost at enormous scale. That's real, hard research. But almost none of it touches the work a normal business needs done. Summarize this. Draft that. Write this function. Answer that ticket. Pull the numbers from this PDF. Every serious model clears that bar today — and has for a while.

Benchmarks are a sport played at the frontier. Your business almost certainly isn't competing at the frontier. You need a model that works — not the one that won this week's leaderboard by half a point.

What actually decides whether AI works for you

If the model were the deciding factor, the companies buying the best models would be winning. They're not. The largest study of enterprise AI to date found the opposite:

95%
of enterprise generative-AI pilots returned no measurable impact on the bottom line
MIT NANDA, "State of AI in Business," 2025
$30–40B
spent getting to that 95%-fail rate — the money wasn't the problem
MIT NANDA, 2025
success rate when teams buy from specialist vendors and integrate well (~67%) vs. build from scratch
MIT NANDA, 2025

None of the 5% that worked got there by picking a smarter model. They got there by how they wrapped the model into real work — the integration, the workflow, the checks around it. The failure mode is almost never "we chose the wrong AI." It's "we never built the system around it." (More on that system: Harness Engineering →)

So how should you actually choose?

Not from a podcast, and not from a benchmark tweet. Five questions that matter far more than this week's rankings:

🧪
Does it work on your tasks?

Point two models at your real codebase or tickets for a week. Measure what finished, how much rework each needed, where each face-planted. Your work is the only benchmark that counts.

🔒
Governance before capability

Where do your prompts, code, and data go? What's the retention? Is there a BAA? For healthcare and finance, this should gate the decision harder than any score.

🤝
Ergonomics & fit

How it fits your stack and your team's day-to-day outlasts a benchmark delta. The model people actually enjoy using is the one that gets used.

🔁
Don't lock in

Keep the model a swappable part. Running two is cheap next to one engineer's hour — and trivial next to the cost of betting the company on the wrong one.

💵
Cost at your volume

Price the actual workload you'll run, not the headline token rate. The "cheaper" model can lose once you count retries and the work it can't finish.

Where we land

We keep this page neutral on purpose — the framework above matters more than anyone's favorite. But people always ask where we land, so here it is, honestly: for most teams today, we lean Claude. Two reasons, and only one of them is ours:

📊
It's where the market already voted.

In the enterprise, Anthropic now leads LLM API usage at 40% (OpenAI 27%, Google 21%). And in coding specifically — the use case that started this whole debate — the gap is wider: roughly 54% Anthropic vs. 21% OpenAI. That's not our opinion; it's where buyers with budgets ended up. (Menlo Ventures, State of Generative AI in the Enterprise, 2025)

🛠️
It matches our hands-on experience.

Across the real systems we build and ship, Claude has been the steadiest default — especially as a coding agent. Your mileage may vary, which is exactly why question #1 exists.

But hear the whole point of the page: that's a lean, not a religion. If Codex wins on your tasks next quarter — test it and let it. A well-built setup lets you switch the model underneath in an afternoon. The recommendation is "start here," not "marry this."

The thing that doesn't change

Here's the reframe to send your team: the model is the engine; the value is the car you build around it. Pick a capable model — almost any frontier one will do — then put your real effort into the system: the context it sees, the checks it has to pass, the guardrails, the workflow. That's the part that compounds, and it survives every model swap and every leaderboard flip. Chase that, not the rankings.

Stop shopping for the winner. Build so it doesn't matter.

For the work most companies do, the frontier models are interchangeable enough that the choice isn't your bottleneck — the system around them is. Pick one that works, keep it swappable, and invest where the leverage actually lives.

Assess your team's stage →
← Workshop Hub