Case Study · Cost Engineering
Free Models, 24/7 Systems
Let me be clear up front: I don't review models. There's a new "king of the leaderboard" every other week, and chasing them is a trap — that's the whole argument of Which AI Should We Use?
But every so often a release makes my point better than I can. GLM-5.2 — an open-weight model out of Z.ai, released under an MIT license — is one of those. This isn't a review. It's a case study in the thing I actually care about: you no longer need frontier prices to run serious, always-on AI systems.
What "free" actually means here
The weights are free to download. The model is not going to run on your laptop — at 753 billion parameters, the raw weights are over 1.5 terabytes, and even a brutally compressed version needs around 200GB of memory. So let's kill that fantasy early.
What open-weight actually buys you is better than a laptop demo:
The open-weight family isn't one giant model — the smaller ones genuinely run on hardware you own, for the cost of electricity. For monitoring, classification, and routing tasks, that's often all you need.
Because anyone can host the weights, hosting providers compete. That competition is why GLM-5.2 lands at roughly one-third to one-fifth the cost of the proprietary equivalents.
It drops into the agent harnesses you already use — Claude Code, Cursor, Open Code. If you built for swappability, trying it costs you an afternoon, not a migration.
Built for exactly the work that eats tokens, too: a 1-million-token context window and a 128K output limit, designed for long-horizon agent workflows — the expensive stuff.
The numbers (with the caveats attached)
It holds its own against the proprietary frontier on the benchmarks that matter for agent work:
Now the honest part, because benchmarks without caveats are marketing: it trails the frontier models badly on long-horizon knowledge work — on multi-week tasks with thousands of fragmented inputs it scores well behind Claude — and it still fumbles the occasional basic edge case. It is not the best model. That's fine. It doesn't need to be the best model to change your economics.
What it looks like in a real harness
Benchmarks are a sport. What I care about is what happens when you give a model a body — files, a terminal, tests to run. Three examples from hands-on agent work:
Tasked with cloning a 3D game, it built its own to-do list, shipped a first draft that froze, took a screenshot of the bug as feedback, and fixed it. Six prompts to working physics, camera, and jump mechanics.
A fully functional page-summarizer extension, built in under four minutes, with a single screenshot correction to fix the UI. This is throwaway-tool territory — build it, use it, delete it.
Tied to a meeting-notes app and told to improve workflows, it set up a recurring Friday job: scrape the notes, find the operational bottlenecks, then write, test, and install custom tools to fix them. Unattended.
That third one is the future of this whole thing. Not a chatbot you talk to — a system that runs while you sleep, notices what's broken, and builds the fix.
What cheap tokens do to your behavior
Here's the part that actually matters, and it's psychology, not technology. When every API call costs real money, you hesitate. You keep agents on a short leash, trim the context, kill the job early. Expensive tokens make you a rationer.
When tokens are nearly free, you experiment. You hand agents massive context files. You let them run longer, auto-debug their own failures, retry until they get it right. You build the personal daily scripts you'd never justify at frontier prices — the pipeline monitor that runs overnight, the agent that checks your sites and files the fix before you've had coffee, the Friday job that audits your own workflows. A sophisticated, 24/7, always-on system working for you — at a price where you don't have to think about it.
So run the same play I recommend for every model decision: point it at your tasks for a week. If it holds up on the heavy, token-hungry agent work, you just cut that line item by 70–80%. If it doesn't, you spent an afternoon finding out — because you built for swappability. Some platforms will even mirror your production traffic to the cheaper model in parallel and tell you when it's safe to switch. That's the level of rigor this decision deserves: measured, reversible, boring.
Stop rationing. Start building the always-on system.
The gap between open-weight and frontier is closing fast, and the price gap isn't. Keep a frontier model where quality is the bottleneck, put a free one where volume is — and invest the savings in the harness, because that's the part that compounds.
How to actually choose a model →