Case Study · Cost Engineering

Free Models, 24/7 Systems

Let me be clear up front: I don't review models. There's a new "king of the leaderboard" every other week, and chasing them is a trap — that's the whole argument of Which AI Should We Use?

But every so often a release makes my point better than I can. GLM-5.2 — an open-weight model out of Z.ai, released under an MIT license — is one of those. This isn't a review. It's a case study in the thing I actually care about: you no longer need frontier prices to run serious, always-on AI systems.

What "free" actually means here

The weights are free to download. The model is not going to run on your laptop — at 753 billion parameters, the raw weights are over 1.5 terabytes, and even a brutally compressed version needs around 200GB of memory. So let's kill that fantasy early.

What open-weight actually buys you is better than a laptop demo:

💻

Small models run on your metal

The open-weight family isn't one giant model — the smaller ones genuinely run on hardware you own, for the cost of electricity. For monitoring, classification, and routing tasks, that's often all you need.

🏷️

Big models get commodity pricing

Because anyone can host the weights, hosting providers compete. That competition is why GLM-5.2 lands at roughly one-third to one-fifth the cost of the proprietary equivalents.

🔌

Same harness, different engine

It drops into the agent harnesses you already use — Claude Code, Cursor, Open Code. If you built for swappability, trying it costs you an afternoon, not a migration.

Built for exactly the work that eats tokens, too: a 1-million-token context window and a 128K output limit, designed for long-horizon agent workflows — the expensive stuff.

The numbers (with the caveats attached)

It holds its own against the proprietary frontier on the benchmarks that matter for agent work:

62.1

SWE-bench Pro score — ahead of GPT-5.5's 58.6 on real-world coding tasks

Published benchmarks, 2026

$0.17

per vulnerability found in Semgrep's security-audit evaluation, on a basic prompt harness

Semgrep IDOR study, 2026

⅓–⅕

the price of Western proprietary equivalents — cheap enough to stop rationing

Hosted API pricing, 2026

token context window for long-horizon agent workflows

Z.ai, GLM-5.2 release

Now the honest part, because benchmarks without caveats are marketing: it trails the frontier models badly on long-horizon knowledge work — on multi-week tasks with thousands of fragmented inputs it scores well behind Claude — and it still fumbles the occasional basic edge case. It is not the best model. That's fine. It doesn't need to be the best model to change your economics.

What it looks like in a real harness

Benchmarks are a sport. What I care about is what happens when you give a model a body — files, a terminal, tests to run. Three examples from hands-on agent work:

🎮

A playable 3D game in six prompts

Tasked with cloning a 3D game, it built its own to-do list, shipped a first draft that froze, took a screenshot of the bug as feedback, and fixed it. Six prompts to working physics, camera, and jump mechanics.

🧩

A Chrome extension in four minutes

A fully functional page-summarizer extension, built in under four minutes, with a single screenshot correction to fix the UI. This is throwaway-tool territory — build it, use it, delete it.

🔁

Automations that improve themselves

Tied to a meeting-notes app and told to improve workflows, it set up a recurring Friday job: scrape the notes, find the operational bottlenecks, then write, test, and install custom tools to fix them. Unattended.

That third one is the future of this whole thing. Not a chatbot you talk to — a system that runs while you sleep, notices what's broken, and builds the fix.

What cheap tokens do to your behavior

Here's the part that actually matters, and it's psychology, not technology. When every API call costs real money, you hesitate. You keep agents on a short leash, trim the context, kill the job early. Expensive tokens make you a rationer.

When tokens are nearly free, you experiment. You hand agents massive context files. You let them run longer, auto-debug their own failures, retry until they get it right. You build the personal daily scripts you'd never justify at frontier prices — the pipeline monitor that runs overnight, the agent that checks your sites and files the fix before you've had coffee, the Friday job that audits your own workflows. A sophisticated, 24/7, always-on system working for you — at a price where you don't have to think about it.

The lesson is not "switch to GLM-5.2." The lesson is that the model is a swappable engine, and some of the engines are now nearly free. If your setup is well-built, trying one costs an afternoon. If trying one would cost you a quarter, your problem was never the model — it was the harness.

So run the same play I recommend for every model decision: point it at your tasks for a week. If it holds up on the heavy, token-hungry agent work, you just cut that line item by 70–80%. If it doesn't, you spent an afternoon finding out — because you built for swappability. Some platforms will even mirror your production traffic to the cheaper model in parallel and tell you when it's safe to switch. That's the level of rigor this decision deserves: measured, reversible, boring.

Stop rationing. Start building the always-on system.

The gap between open-weight and frontier is closing fast, and the price gap isn't. Keep a frontier model where quality is the bottleneck, put a free one where volume is — and invest the savings in the harness, because that's the part that compounds.

How to actually choose a model →

← Which AI Should We Use?