A Cheaper Model Showed Up. We Ran the Flywheel Before Trusting It.

Google shipped a cheaper, faster extraction model. Instead of swapping it in and hoping, we put it through Headkey's flywheel. The result: same quality, 58% less cost — and one gotcha that would have poisoned the whole experiment.

A Cheaper Model Showed Up. We Ran the Flywheel Before Trusting It.

The temptation

A new model lands. It's cheaper, it's faster, the benchmarks look great. The temptation is to change one line of config, watch the bill drop, and move on. The risk is that you've quietly degraded the most important step in your pipeline and won't notice until your users do.

We hit exactly this moment with Headkey. Google's gemini-3.1-flash-lite advertised a fraction of the price and latency of the model we use for extraction. The right question isn't "is it cheaper?" — it obviously is. The right question is "is it cheaper at the same quality, and how would we know?"

This is the story of answering that with data instead of vibes. If you've read our earlier post on flywheel-driven development, this is that idea applied to a real decision.

What "extraction" is, and why the model matters

Headkey is a memory layer for AI agents. When an agent learns something — "the auth service in repo A uses 15-minute JWTs" — Headkey has to turn that sentence into structured knowledge: entities, relationships, and belief proposals (subject–predicate–object tuples) it can store, reinforce, and contradict later.

That conversion is the extraction step. It's an LLM call that runs on every piece of information the agent ingests. It's the highest-volume, highest-leverage model call in the whole system. If extraction gets sloppy, everything downstream — resolution, belief formation, recall — inherits the sloppiness. So this is precisely the call site where "trust the benchmark" is not good enough.

The rule: don't swap on vibes

We keep a flywheel for exactly this: a harness that runs the real extraction call against a suite of hand-built test cases, checks the structured output against assertions, and records tokens, cost, and latency for every call. No mocks. Real model, real prompt, real schema.

The suite is 46 cases across eight families — categorization, date handling, subject-and-predicate splitting, qualifier extraction, refusals, and the genuinely hard stuff like disambiguating two things that share a name. Each case has machine-checkable assertions, so "did it get better or worse" is a number, not an opinion.

The plan was simple: run the incumbent, run the challenger, change nothing else, and diff.

Making the swap a one-line change

Here's a practical detail worth stealing. We didn't write a Gemini client. Gemini exposes an OpenAI-compatible endpoint, so the exact same SDK, the exact same request — including the strict JSON-schema response format we rely on — points at Google's servers by changing a base URL and an API key.

We added a small seam: extraction can use a separate provider from the rest of the pipeline, controlled by one environment variable. Off by default, zero behavior change when unset. With it set, extraction runs on Gemini while everything else stays put. That isolation is what makes a clean A/B possible — you're swapping one call site, not the whole system.

The pre-flight that saved the experiment

Before running a single suite, we ran a one-call smoke test. The reason: our extraction request demands strict JSON-schema output over a large schema. OpenAI-compatible shims don't all honor strict faithfully — and a shim that silently ignores it would return slightly-wrong JSON that looks like a quality regression when it's really a protocol mismatch. You'd spend a day blaming the model for a plumbing bug.

The smoke test confirmed Gemini honored the contract exactly — same schema, same shape, valid every time. Only then did the full run earn any trust.

The gotcha: "minimal" reasoning made it worse

One detour worth sharing, because it's counterintuitive. Both models can "think" before answering, which costs latency and tokens. Extraction is a throughput path, so we want thinking off. The knob is reasoning_effort.

Reading "use minimal reasoning," the obvious move is to set it to low. That was wrong. On Flash-Lite, low enabled thinking — it tripled latency and actually dropped a belief on a multi-fact sentence. The correct floor was none, which on this model is effectively the default: thinking stays off, full speed, no quality loss. The lesson is the same one the flywheel teaches over and over — measure the knob, don't assume it.

The results

We ran every case three times, at temperature 0, on both models. Three repetitions matter: a single run can get lucky. With three, every case landed at either 100% or 0% pass — zero flaky — which tells us the scores are stable, not noise.

Metric (46 cases × 3 reps)Incumbent (gpt-5.4-nano)Challenger (gemini-3.1-flash-lite)
Stable pass / fail43 / 343 / 3
Mean pass-rate93%93%
Total cost$0.1081$0.0456 (−58%)
Wall time232s154s (−33%)
Per-case cost$0.0018–0.0031$0.0008–0.0012

Identical quality. Roughly 40% of the cost and two-thirds the latency. And Gemini posted a lower median and tail latency in every single suite, not just on average.

Same score, different mistakes

Here's the part a single number would hide. Both models failed exactly three cases — but not the same three.

Two failures were shared: a couple of hard cases involving bare character names and appositive phrases that are a known limitation of this class of model, independent of vendor. Both models stumble identically there, so it's not a tiebreaker.

The interesting part is the one failure each model owns alone:

  • The incumbent kept a date as 2023/02/20 instead of normalizing it to ISO 2023-02-20. Gemini normalized dates cleanly — it passed the entire date-handling suite.
  • Gemini took a scoped fact ("in the platform project, the cache is Redis") and put the scope in a dedicated qualifier field instead of baking it into the subject the way our prompt expects. It parsed the sentence correctly — it just filed the context one slot over.

So they're not better-and-worse; they're differently shaped. Each has a soft edge the other doesn't, and both edges are fixable with a prompt tweak rather than a model change. That nuance only shows up because the harness reports per-case verdicts, not just a final tally.

What this means

Three takeaways we'd hand to anyone running a model pipeline:

  • Build the seam before you need it. Because extraction was already swappable behind one config var, testing a new provider cost an afternoon, not a refactor. The day a cheaper model ships, you want to be one environment variable away from an honest comparison.
  • Pre-flight the contract, then trust the numbers. The strict-schema smoke test is the cheapest insurance you'll ever buy. Validate that the new provider speaks your exact dialect before you read a single quality score.
  • The flywheel makes "should we switch?" a measurement, not an argument. We didn't debate Gemini's merits. We ran 276 real extractions, read the deltas, and the decision made itself.

The honest caveats

This measures the extraction call in isolation — one important step, not the whole pipeline. Before Gemini becomes the production default, we'll run it end-to-end (does swapping it change how well agents actually recall things?) and settle the one place the two models disagree about where scope belongs. Forty-six cases is enough to rank cost and catch regressions, not enough to publish a leaderboard. And the prompt was tuned for the incumbent, so this is the conservative, apples-to-apples read — Gemini's ceiling under a Gemini-tuned prompt is a separate experiment.

But the headline holds: a model that costs 40% as much and answers in two-thirds the time, at parity on a suite designed to catch exactly the mistakes we care about. That's a switch worth making — and now we can make it with our eyes open.


Headkey is a memory layer for AI agents. The evaluation harness in this post — real calls, per-case verdicts, cost and latency traces — is the same flywheel we use to tune every model-dependent step in the system.

https://headkey.ai