75% vs 53%'

ARC-AGI-2 2025: From Sub‑10% to 75%+ with the Poetiq Harness on GPT‑5.2

At the start of 2025 the ARC‑AGI‑2 benchmark was a humbling reminder: even the newest LLMs were stuck under 10 % on the hardest variants. That gap wasn’t a mystery of data; it was a symptom of how we were using the models. By year‑end the picture flipped, not because a new model arrived, but because we wrapped the same model in a smarter orchestration layer.

Core baselines – solid but human‑below

The strongest standalone offering is GPT‑5.2 X‑High. In its raw “thinking” mode it reaches roughly 53 % on ARC‑AGI‑2. That is the best you can get from a pure inference pass. Competing models sit lower: GPT‑5.2 High at 40 %, Gemini 3 Pro at about 32 %. All sit shy of the ~60 % human average that the benchmark defines as the baseline for competent reasoning.

Running these models without any external guidance quickly becomes expensive. High‑end variants can cost $20 + per task while still failing to clear the human bar. The limitation is not capacity; it is the lack of a systematic loop that can test, refine, and validate its own output.

Core vs Poetiq performance

The Poetiq harness lifts the same model from 53 % to 75 %.

Poetiq’s meta‑system – the game changer

Poetiq is not a new model; it is a manager that runs the model through iterative code‑generation loops, self‑audits each result, and feeds the refined output back into the next pass. In practice this means the model can decompose a complex puzzle, write test code, execute it, compare results, and correct mistakes without human intervention.

That process alone explains the jump from ~53 % to ~75 %. The harness turns a decent reasoner into a reliable solver that consistently beats the human baseline on ARC‑AGI‑2.

Cost efficiency – more output for less spend

Because Poetiq can get the most out of cheaper base models, the per‑task cost settles around $8‑$10. By contrast, brute‑forcing a comparable score with a high‑end core model can exceed $20 per task. 

Why orchestration matters more than raw scale

The data makes a simple point: the ceiling on current LLMs is not the number of parameters but the way we structure their execution. A model that can call tools, run code, and audit its own reasoning extracts far more value than a larger model that merely generates text.

This insight mirrors what we observed with GPT‑5.2‑Codex. There the agentic wrapper, not the model itself, delivered the headline‑making performance.

Takeaways for practitioners

  • Invest in meta‑systems that can decompose, iterate, and self‑audit. The payoff is a vertical jump in benchmark scores.
  • Don’t assume bigger models automatically beat human baselines. Orchestration can achieve the same or better results at lower cost.
  • Benchmark progress should be measured both in raw scores and in cost‑efficiency. Poetiq’s $8‑$10 per task is a compelling metric.

In short, the most interesting chart of 2025 isn’t a new model ranking; it’s the same model’s dot soaring upward once you plug it into a well‑designed harness. That’s the direction the field is moving in, and it’s a shift worth watching.