Pure white background with centered black sans serif text that reads AI gap

The AI “Gap” Is Not 7 Months – It’s A Messy Vector

Those “gap is closing” AI charts keep showing up. Two smooth lines, a bold “7 months” label, and suddenly the takeaway is: open models are basically caught up with the frontier.

The problem is simple: most of that convergence is baked into the benchmarks, not revealed by them.

GPQA: Convergence Guaranteed By Construction

Start with the Epoch AI GPQA-Diamond plot that keeps circulating. Time on the x-axis, accuracy from 0% to 100% on the y-axis. Teal dots for frontier models like GPT-4, GPT-4o, o1-mini, Grok 4. Pink dots for smaller open or China-adjacent models that can run on consumer GPUs. Two regression lines, a “7 months” arrow between a teal point and a pink point, and the slogan writes itself: “open is only seven months behind.”

That visual feels persuasive because it borrows the look of a geopolitical chart: US vs China, closed vs open, two lines converging toward parity. The problem is that GPQA is capped at 100%. Once serious labs start training directly for it, the lines have nowhere to go except up and toward each other.

GPQA-Diamond itself is a clever test: 198 graduate-level, Google-resistant science questions written and checked by domain experts. That is exactly the kind of thing you want for early signal on reasoning. But once the strongest labs start aiming straight at those 198 items and that style of question, you are no longer looking at a raw capability probe, you are looking at a scoreboard.

Early on, going from 20% to 40% looks like huge progress. Later, going from 70% to 80% is harder in real terms but visually smaller. Near the ceiling, wringing out a few extra percentage points costs a pile of compute and clever training and barely moves the dot. On a chart like that, convergence is not a deep insight about geopolitics. It is what happens when you near the top of a bounded test.

Benchmark convergence example

Two groups racing toward a 100% cap will always look like they are “closing the gap” at the top, even if the harder step-up costs are very different.

You could redraw the same style of chart as “cheap vs expensive,” “US vs Europe,” or “frontier vs mobile-friendly.” As long as everyone is training hard on a benchmark that is near saturation, every group’s trend line hugs the ceiling and the apparent gap shrinks.

On that plot the gap must close, because the test has a roof.

METR’s Coding Chart: One Slice, One Scoring Rule

Now layer on the METR chart people quote for long-horizon coding. The headline version goes like this: “Kimi-K2 Thinking scores about the same as Claude 3.7 Sonnet on METR, so open is only 7–9 months behind frontier on long-context software tasks.”

That chart measures the time horizon of coding tasks a model can solve 50% of the time on a curated set. Older models sit down at short tasks. Newer models climb into multi-hour territory. GPT-5.1-Codex-Max xhigh and relatives are near the top band, Claude Sonnet is below that, and Kimi-K2 Thinking roughly matches Sonnet on that particular success threshold.

From there people jump straight to “open is 7–9 months behind frontier.” Same story shape as GPQA, just a different benchmark.

There are at least two problems here:

  • It is one evaluation, on one task distribution, with one scoring rule: 50% success on this handpicked set of long-horizon coding problems. Change the threshold to 80% or change the task mix and the picture shifts.
  • Equal scores do not mean equal tools. In practice Kimi-K2 Thinking is a much more useful agent for long-horizon work than that score suggests: cheaper, more efficient, and more agent-friendly for workflows that care about cost per completed task. I wrote more about that in Kimi K2 Thinking Aftermath: Great Agent, Mediocre Writer.

METR saying “Kimi-K2 Thinking and Claude Sonnet hit a similar 50% line” does not magically make them interchangeable in real usage. It just says they tied under that metric on that dataset.

The Gap Is Not A Number, It Is A Vector

By this point you have two different “gaps” in circulation:

  • GPQA: open or China-tied models look about 7 months behind on a near-saturated QA test.
  • METR: Kimi-K2 Thinking lands about 7–9 months behind the cutting edge on one long-horizon coding benchmark.

People talk about these like they are hard constants, as if there were a fixed 7–9 month lag etched into the laws of physics.

The moment you change what you measure, the story jumps.

Look at hard math instead. Frontier Math is nowhere near saturated. On that kind of eval, Gemini 3 and a few other top models blow past the rest. The gap between the absolute frontier and the cluster of “good but cheaper” models is large again. That picture looks closer to the real strategic separation than a GPQA chart smashed against 100%.

Or look at usage rather than raw scores. In the West, production workloads still skew heavily toward closed cloud models. In China, strong local models plus open training recipes are starting to matter more, especially for teams that care a lot about price per unit work. In coding agents, models like GPT-5.1-Codex-Max xhigh have an obvious edge in raw ability, which is exactly why I wrote about it as a strong agentic coder in GPT-5.1-Codex-Max xhigh: Strong Agentic Coder, Horrible Name.

So the “gap” is not one scalar. It is a vector with components like:

  • Closed vs open training and deployment.
  • US vs China and friends.
  • Cloud API vs local or on-prem inference.
  • Saturated benchmarks vs still-challenging evals.
  • List price vs cost per real task, including failures.

Each axis behaves differently. As benchmarks saturate, the charts along the score axis look tighter, while the experience of actually building with these systems can still feel very far apart.

Scoreboards, Overfitting, And Fake Comfort

There is another failure mode that almost never shows up in those pretty regression plots: scoreboarding.

Once a benchmark becomes a status symbol, labs start training directly to that scoreboard. They rustle up specialized data, do targeted post-training, tweak model routing, and basically overfit to that style of question. The score climbs. The lines close. Everyone declares victory.

Then you move to a fresh eval distribution that nobody has drilled on yet, and the frontier models pull way ahead again. The scoreboard improvements did not generalize. They just made the saturated plot look “good.”

I see a similar pattern in a lot of self-improving model papers. On paper, they get nice bumps on a chosen benchmark by pouring more compute into a tight loop. In practice the effect is closer to more targeted training on the same scoreboard. I wrote about that pattern in MIT’s SEAL Self-Adapting Language Model: Why Most Self-Improving AI Papers Are Just More Compute.

The GPQA and METR plots sit right on that fault line. They are valuable tools. They are not universal truth oracles.

How To Read “Gap Is Closing” Claims

So what do you do with all this if you are trying to build real systems instead of win on a chart?

A few practical checks:

  • Check the ceiling. If the benchmark tops out at 100% and most serious models are above roughly 70%, assume the lines will converge visually even if the cost to get those last points is wildly different.
  • Check the slice. A single task family – long-horizon coding, grad-level QA, tool use – gives you a partial ordering, not a full one.
  • Check real tasks. What are teams actually shipping with? Which models stay up under load, behave predictably, and hit your cost targets? This is where a lot of my own work on systems and orchestration matters more than raw scores.
  • Assume some overfitting. Once a leaderboard exists, training to the leaderboard exists. Fresh tasks will often reopen the gap.

Underneath the charts, you are still picking which mistakes you want and what you are willing to pay for them. I wrote about that tradeoff more generally in AI Errors vs Human Errors: You’re Choosing Which Mistakes You Want. Benchmarks do not remove that choice. They just compress it into a handful of scores.

What This Means When You Are Picking Models

If you are deciding between a closed frontier model and an open or regional one, the right move is not to memorize a single “months behind” number. It is to decide which axes of the gap vector actually matter for your situation.

  • If you care most about frontier math, safety margins, or bleeding-edge agent behavior, you probably want the strongest closed model you can afford, even if the GPQA chart says the gap looks small.
  • If you care most about cost per workflow, data privacy, or running close to the user on local hardware, an open or China-adjacent model that is “behind” on a saturated benchmark may still be the dominant choice.
  • If you are building agents, think about tools and routing first. A slightly weaker base model with better tool-use, better planning, and saner pricing can beat a nominally stronger model in overall throughput. That is the whole reason I built my own orchestration stack and wrote pieces like When Does a Chatbot Become an Agent? Chat Interface vs AI Autonomy and AI Dashboard Update: A Central Hub for Artificial Analysis, OpenRouter, fal and More.

Benchmarks are still useful as rough filters. Just do not turn them into numerology about “the” gap.

The neat “open is only 7 months behind” narrative is mainly a story about how those particular tests behave near their ceiling. Once you look at harder evals like Frontier Math, actual adoption, and full workflows instead of single numbers, the picture is much messier.

And that is the real point: the AI gap is not one number on a chart. It is a moving vector. Any argument that fits inside two regression lines and a “7 months” label is mostly telling you about the benchmark, not the models.