Deep Dive 10 · Strategy
Which AI Should We Use?
Claude or Codex? GPT or Gemini? One week a benchmark crowns a new king; the next week it flips. Here's the honest answer most vendors won't give you: for the work your company actually does, the model you pick matters far less than you think.
The labs will keep trading the lead, and the handful of things that truly decide whether AI works for you aren't on any leaderboard. Let's look at the real numbers — and the questions worth asking instead.
The question that never resolves
"Let's wait until there's a clear winner before we commit." It's the most reasonable-sounding plan in AI, and it's a trap. Two of the best-funded labs on earth are in a footrace — and footraces don't "level out." Whoever leads this month gets passed next month, on purpose, by design. If your strategy depends on the dust settling, your strategy is to wait forever.
The good news: you don't have to wait, because the thing everyone's racing over has quietly stopped being the thing that matters.
The models have already converged
The frontier used to be a cliff. Now it's a crowd. On the public benchmarks everyone quotes, the top models are bunched so tightly the differences are almost noise:
The gaps that make headlines live at the fringe — exotic multi-step reasoning, training-time tricks, inference cost at enormous scale. That's real, hard research. But almost none of it touches the work a normal business needs done. Summarize this. Draft that. Write this function. Answer that ticket. Pull the numbers from this PDF. Every serious model clears that bar today — and has for a while.
What actually decides whether AI works for you
If the model were the deciding factor, the companies buying the best models would be winning. They're not. The largest study of enterprise AI to date found the opposite:
None of the 5% that worked got there by picking a smarter model. They got there by how they wrapped the model into real work — the integration, the workflow, the checks around it. The failure mode is almost never "we chose the wrong AI." It's "we never built the system around it." (More on that system: Harness Engineering →)
So how should you actually choose?
Not from a podcast, and not from a benchmark tweet. Five questions that matter far more than this week's rankings:
Point two models at your real codebase or tickets for a week. Measure what finished, how much rework each needed, where each face-planted. Your work is the only benchmark that counts.
Where do your prompts, code, and data go? What's the retention? Is there a BAA? For healthcare and finance, this should gate the decision harder than any score.
How it fits your stack and your team's day-to-day outlasts a benchmark delta. The model people actually enjoy using is the one that gets used.
Keep the model a swappable part. Running two is cheap next to one engineer's hour — and trivial next to the cost of betting the company on the wrong one.
Price the actual workload you'll run, not the headline token rate. The "cheaper" model can lose once you count retries and the work it can't finish.
Where we land
We keep this page neutral on purpose — the framework above matters more than anyone's favorite. But people always ask where we land, so here it is, honestly: for most teams today, we lean Claude. Two reasons, and only one of them is ours:
In the enterprise, Anthropic now leads LLM API usage at 40% (OpenAI 27%, Google 21%). And in coding specifically — the use case that started this whole debate — the gap is wider: roughly 54% Anthropic vs. 21% OpenAI. That's not our opinion; it's where buyers with budgets ended up. (Menlo Ventures, State of Generative AI in the Enterprise, 2025)
Across the real systems we build and ship, Claude has been the steadiest default — especially as a coding agent. Your mileage may vary, which is exactly why question #1 exists.
But hear the whole point of the page: that's a lean, not a religion. If Codex wins on your tasks next quarter — test it and let it. A well-built setup lets you switch the model underneath in an afternoon. The recommendation is "start here," not "marry this."
The thing that doesn't change
Here's the reframe to send your team: the model is the engine; the value is the car you build around it. Pick a capable model — almost any frontier one will do — then put your real effort into the system: the context it sees, the checks it has to pass, the guardrails, the workflow. That's the part that compounds, and it survives every model swap and every leaderboard flip. Chase that, not the rankings.
Stop shopping for the winner. Build so it doesn't matter.
For the work most companies do, the frontier models are interchangeable enough that the choice isn't your bottleneck — the system around them is. Pick one that works, keep it swappable, and invest where the leverage actually lives.
Assess your team's stage →