ARC-AGI-3 Launch: SOTA Models Score Under 1% and the Human Baseline Is Rigged

ARC-AGI-3 launched on March 25, 2026, as an interactive reasoning benchmark. Where ARC-AGI-1 and ARC-AGI-2 gave models static puzzles, ARC-AGI-3 puts agents inside video-game-like environments with over 1,000 levels across 150+ environments. The agent has to perceive, act, plan, and adapt to reach long-horizon goals. It is the first major format shift for the ARC series since 2019.

Current SOTA models score under 1% on preview tasks. The best result from the preview competition was 12.58%, posted by Tufa Labs under the handle StochasticGoose. That is with deliberate effort to optimize for the benchmark. Base frontier models are nowhere close. Meanwhile, the benchmark is designed to be human-winnable. The gap is real and it is large.

How the Scoring Actually Works

The metric is called Relative Human Action Efficiency, or RHAE. For each level completed, the score is calculated as (human baseline actions / AI actions) squared, capped at 1.0. If a human completes a level in 10 actions and an AI takes 20, the score is 0.25. If the AI takes 100 actions, the score drops to 0.01. The scores aggregate as a weighted average that overweights harder levels.

The human baseline per level is set to the second-best human run from a group of first-time players, with outliers removed. The stated reasoning is that it represents proficient but achievable human performance, not a world record. That framing is doing a lot of work.

RHAE Score chart showing how AI action count affects score when human baseline is 10 actions

The quadratic penalty is steep. An AI that takes twice as many actions as the second-best human scores 25%, not 50%. That matters because the baseline is not the average human. It is the near-top performer from a controlled group of first-time players. Most people who attempt these levels will score below 100% by this metric. That is the part ARC Prize glosses over when it says human performance is 100%.

The Problem With Calling It 100% Human Performance

The official framing is that human performance is 100%. That claim is technically true only for the second-best human in the study group. Prior ARC human studies showed tasks solvable by 2 to 10 non-expert participants, but not all of them succeeding. The average person who has general intelligence and tries these games will not hit the baseline. They will score below it.

Saying humans score 100% while AI scores under 1% is designed to make the gap sound like a clean measurement of something meaningful. It is not. It is a comparison between current AI and the near-top of a selected human cohort, reframed as if the average person walks into these games and breezes through them. The average person does not. This is not a small methodological footnote. It changes how you read the headline numbers entirely.

If the baseline were set to the median first-time human participant, the gap would still be real and still be large, but the framing would be more honest. You can build a rigorous benchmark and still be honest about what the numbers actually represent.

The AGI Naming Problem

There is a broader issue with putting AGI in the benchmark name. ARC-AGI-1 got saturated, so they released ARC-AGI-2. ARC-AGI-2 saw top Kaggle scores at 24% with the best human performance listed at 100%, but average human solve rates implied around 53% or lower on tested tasks. Now ARC-AGI-3 is out. When a model eventually closes the gap on ARC-AGI-3, there will presumably be an ARC-AGI-4.

The argument for iterating is reasonable on its face. If a benchmark saturates, it no longer tells you anything useful about the frontier, so you raise the bar. That is a fine practice for a benchmark series. But calling it an AGI benchmark while building in a moving goalpost is a contradiction. If the bar moves every time it gets cleared, you are not measuring AGI. You are maintaining a leaderboard that AI has never technically won and, by design, never will.

ARC Prize frames this as testing fluid intelligence and skill acquisition rather than recall or pattern matching baked in from training data. That framing is legitimate, and the interactive format genuinely resists brute-force scaling in ways that static benchmarks do not. The design philosophy is sound. The marketing around it is not.

What ARC-AGI-3 Actually Tests

The shift from static tasks to interactive environments is the real story here. Prior versions gave models a grid puzzle and asked for the output. ARC-AGI-3 puts agents inside an environment where they have to explore, figure out the rules through action, and then execute toward a goal. Humans spend actions on exploration, learning how the environment works, and on execution, actually completing the objective. The RHAE metric tracks both and compares them directly.

The benchmark explicitly excludes language and trivia dependence. It tests core priors like object permanence, goal-directedness, and spatial reasoning. That is meaningful because it means you cannot just throw a larger language model at it and expect a meaningful score improvement. The current results confirm this. Frontier models from Anthropic, Google DeepMind, OpenAI, and xAI all report ARC scores on their model cards, and none of them are close on ARC-AGI-3.

The approach that showed the most promise in the preview was reinforcement learning-style agents that learn through interaction rather than models trying to reason through tasks in a single pass. That tracks with what the benchmark is designed to measure. For a look at how coding-focused benchmarks handle similar agent evaluation problems, CursorBench-3 takes a comparable approach to evaluating real task completion rather than static outputs.

Where This Leaves the Benchmark

ARC-AGI-3 is a harder and more interesting test than its predecessors. The interactive format is a meaningful step forward. The gap between current AI and the human baseline is real, and the benchmark is designed in a way that makes gaming it harder than most leaderboard benchmarks. Those are genuine strengths worth acknowledging.

The scoring methodology is defensible if you understand what it is actually measuring. The problem is the way it is communicated. Saying humans score 100% when you mean the second-best human from a controlled study is the kind of framing that makes benchmark skeptics roll their eyes, and they are not wrong to. The disingenuous copy does real damage to the credibility of an otherwise well-designed evaluation.

If ARC-AGI-3 gets saturated in a year or two, they will release ARC-AGI-4. That is fine as a benchmarking strategy. Just stop calling it an AGI benchmark if the definition of passing keeps moving every time something gets close to passing it.

Links

They're clicky!

Follow me on X Visit Ironwood AI →

Adam Holter

Founder of Ironwood AI. Writing about AI stuff!