At the start of 2025 the ARC‑AGI‑2 benchmark was a humbling reminder: even the newest LLMs were stuck under 10 % on the hardest variants. That gap wasn’t a mystery of data; it was a symptom of how we were using the models. By year‑end the picture flipped, not because a new model arrived, but because we wrapped the same model in a smarter orchestration layer.
Core baselines – solid but human‑below
The strongest standalone offering is GPT‑5.2 X‑High. In its raw “thinking” mode it reaches roughly 53 % on ARC‑AGI‑2. That is the best you can get from a pure inference pass. Competing models sit lower: GPT‑5.2 High at 40 %, Gemini 3 Pro at about 32 %. All sit shy of the ~60 % human average that the benchmark defines as the baseline for competent reasoning.
Running these models without any external guidance quickly becomes expensive. High‑end variants can cost $20 + per task while still failing to clear the human bar. The limitation is not capacity; it is the lack of a systematic loop that can test, refine, and validate its own output.
The Poetiq harness lifts the same model from 53 % to 75 %.
Poetiq’s meta‑system – the game changer
Poetiq is not a new model; it is a manager that runs the model through iterative code‑generation loops, self‑audits each result, and feeds the refined output back into the next pass. In practice this means the model can decompose a complex puzzle, write test code, execute it, compare results, and correct mistakes without human intervention.
That process alone explains the jump from ~53 % to ~75 %. The harness turns a decent reasoner into a reliable solver that consistently beats the human baseline on ARC‑AGI‑2.
Cost efficiency – more output for less spend
Because Poetiq can get the most out of cheaper base models, the per‑task cost settles around $8‑$10. By contrast, brute‑forcing a comparable score with a high‑end core model can exceed $20 per task.
Why orchestration matters more than raw scale
The data makes a simple point: the ceiling on current LLMs is not the number of parameters but the way we structure their execution. A model that can call tools, run code, and audit its own reasoning extracts far more value than a larger model that merely generates text.
This insight mirrors what we observed with GPT‑5.2‑Codex. There the agentic wrapper, not the model itself, delivered the headline‑making performance.
Takeaways for practitioners
- Invest in meta‑systems that can decompose, iterate, and self‑audit. The payoff is a vertical jump in benchmark scores.
- Don’t assume bigger models automatically beat human baselines. Orchestration can achieve the same or better results at lower cost.
- Benchmark progress should be measured both in raw scores and in cost‑efficiency. Poetiq’s $8‑$10 per task is a compelling metric.
In short, the most interesting chart of 2025 isn’t a new model ranking; it’s the same model’s dot soaring upward once you plug it into a well‑designed harness. That’s the direction the field is moving in, and it’s a shift worth watching.