OpenAI’s o3 model scored 75.7% on the ARC-AGI Semi-Private Evaluation, with a high-compute version reaching 87.5%. These numbers matter because they show real progress in AI’s ability to adapt to new situations.
The ARC-AGI test specifically measures an AI’s capability to handle tasks it hasn’t seen before. Previous models struggled with this – o1-preview only managed 21.2%, while Claude 3.5 Sonnet hit 21% and Gemini 1.5 scored 8%.
But let’s be clear about what this means. Despite these impressive scores, o3 still fails at some basic tasks that humans find trivial. This tells us we’re not looking at AGI yet.
I’ve written before about the different levels of AI systems and how we measure their capabilities. You can read more about that here: https://adam.holter.com/the-5-levels-of-ai-agents-from-basic-chatbots-to-self-improving-systems/
The open-source AI community should pay attention to this development. ARC Prize announced they’ll keep running their Grand Prize competition until someone creates an efficient, open-source solution that scores 85% on the latest ARC-AGI. Given the recent trends in open-source AI development (https://adam.holter.com/the-state-of-open-source-ai-in-2024-efficiency-beats-scale/), this sets up an interesting challenge.
My take? This is solid technical progress, but we shouldn’t overstate it. O3 shows better reasoning abilities than previous models, but it’s still fundamentally different from human intelligence. The real breakthrough will come when we see consistent performance across all types of tasks, not just improved scores on specific benchmarks.