The Artificial Analysis Intelligence Index v3 has just dropped. Its a big deal, offering the most rigorous, transparent, and standardized evaluation of LLMs to date. This index blends eight text-only benchmarks, covering everything from reasoning and knowledge to coding and new agentic workflows with Terminal-Bench Hard and -Bench Telecom. They use tight b1 1% uncertainty on prior runs, which is impressive for cross-model comparison.
But here’s the thing: no single benchmark tells the whole story. The index is solid, but its still text-only and doesn’t capture everything. Models like Claude Sonnet 4, for example, score 44 on the index but are widely recognized as top performers for coding and agentic tasks. This means the headline score needs to be balanced with a more nuanced understanding of a model’s strengths and weaknesses for real-world applications. Relying only on a benchmark score is a mistake. Ive seen this play out many times, where models with lower scores perform better in specialized contexts.
The entire model landscape feels like it’s split down the middle right now: closed models from OpenAI, Anthropic, Google, and xAI on one side, and a rapidly improving open-source ecosystem with GLM, Qwen, DeepSeek, and Gemma on the other. This setup means theres a model for almost any use case, whether you need deep enterprise reasoning or something fast and cheap to run locally.

The LLM landscape clearly shows closed models on one side, open-source models on the other.
The 80-20-0 Rule: Great, Decent, and Trash Models
Theres a clear split in how models perform and how often you should use them. I call it the 80-20-0 rule:

The 80-20-0 rule for model quality. Use great models most of the time.
Great Models (80% of the time):
These are the workhorses. The models you use when quality, reliability, and capability are non-negotiable. GPT-5 Thinking, Claude Sonnet 4, Qwen3 Coder, GLM 4.5, and Gemini 2.5 Pro are the leaders. They handle general intelligence, coding, and complex reasoning with consistency.
Decent Models (20% of the time):
These are good for specific tasks or when you need something more cost-effective. They perform well but don’t quite hit the frontier capabilities of the ‘Great’ models.
Trash Models (0% of the time):
Avoid these. They’re unreliable, prone to hallucination, or have fundamental limitations. We dont even include them in our data because theyre not worth anyones time. Llama 4 Maverick and Scout, Llama 3.3 70B, Minimax M1 Destral, Magistral Small, Nemotron, and now, Qwen3 Max, all fall into this bucket.
The Greats: Your Top 5 Commanders
These models are currently dominating the scene. They offer distinct advantages for different use cases, but all deliver top-tier performance.

The ‘trading cards’ for the top five great models.
1. GPT-5 Thinking (OpenAI)
- Strengths: World-class knowledge, highest innate intelligence, creative writing, top-tier code (though requires specialized prompting for robust agentic coding).
- Weaknesses: Often overkill for simpler tasks, can be slower, and tool calls are not as strong as Claude.
- Cost to run (AA Index task): $873 (high), $460 (medium).
- Context: 400K tokens.
- Best for: Situations demanding complex reasoning, deep analysis, or high-stakes creative generation.
2. Claude Sonnet 4 (Anthropic)
- Strengths: Exceptional at coding and for agentic workflows, superb writing and design, strong tool use. Claude has always been at the forefront of prompt engineering and structured output and it really shows in areas like browser AI safety.
- Weaknesses: Not ideal for heavy math and abstract reasoning, and it’s expensive.
- Cost to run (AA Index task): $665.
- Context: 200K tokens.
- Best for: Coding projects, complex agent orchestration, detailed design work, and tool-augmented workflows.
3. Qwen3 Coder (Alibaba)
- Strengths: Blazing fast, excels at coding, exceptional tool calling, strong in front-end design.
- Weaknesses: Limited creativity and reasoning beyond coding contexts, lacks vision.
- Cost to run (AA Index task): $80.
- Context: 256K tokens.
- Best for: High-speed coding tasks, structured data output, and tool-driven automation where vision isn’t needed.
4. GLM 4.5 (Z.ai)
- Strengths: Good across coding, writing, and design, performs well with agents, and is very cost-effective.
- Weaknesses: Weaker on heavy, complex reasoning, no vision capabilities.
- Cost to run (AA Index task): $232.
- Context: 128K tokens.
- Best for: Cost-effective agentic tasks, general creative output, and coding where the absolute highest reasoning isn’t required.
5. Gemini 2.5 Pro (Google)
- Strengths: Very intelligent overall, good at coding, strong in design and creative writing, offers a long context window. It also shows world-class performance for image generation.
- Weaknesses: Can be unreliable and difficult to steer, inconsistent with tool use.
- Cost to run (AA Index task): $1012.
- Context: 1,000K tokens.
- Best for: Extensive long-context analysis, creative content generation, and multimodal applications where its long context shines.
The Decent Models: Filling the Gaps
The models in this category are perfectly capable for a variety of tasks, often offering a sweet spot between performance and cost. They might not be the absolute frontier models, but they deliver solid results when aligned with the right problem.

The ‘trading cards’ for the decent models.
Decent Models: Snapshot Comparison
A selection from the ‘decent’ category, balancing intelligence and cost.
| Model | Company | Intelligence Index | Cost to Run (Index Task) | Key Strength/Use |
|---|---|---|---|---|
| o3 | OpenAI | 65 | $407 | Reliable reasoning, good at tool use. |
| o4-mini (high) | OpenAI | 59 | $337 | Fast, cheap, reliable reasoning for agents. |
| GPT-5 Mini (high) | OpenAI | 62 | $170 | Cheap reasoning, reliable structured output, instruction following. |
| GPT-4.1 | OpenAI | 43 | $69 | Reliable tool calls, structured output, long context. |
| GPT-5 Nano (high) | OpenAI | 49 | $61 | Smart for its weight class, great for summarization, very cheap. |
| Grok 3 mini Reasoning (high) | xAI | 57 | $57 | Incredible price/performance, great for agents and context. |
| Kimi K2 | Moonshot | 46 | $78 | Coding, tools/agents, fast on Groq. |
The cost to run metric from the Artificial Analysis Index is a solid improvement over raw token pricing. It factors in how many tokens it takes to complete a task, giving a more realistic picture of operational expense. Still, the index’s high volume nature can make models appear more expensive for raw benchmarking than in typical single-request use.
Top 5 Most Expensive Models by Artificial Analysis Run Cost
The most expensive models to run based on the Artificial Analysis Index task, reflecting significant task completion costs.
Top 5 Least Expensive Models by Artificial Analysis Run Cost
The cheapest models to run based on the Artificial Analysis Index task, highlighting efficient options.
New Entrants and Shakers: Kimi K2.1 and Qwen3 Max
Kimi K2.1 (Moonshot) The New Contender
There’s a new player on the block: Kimi K2.1 (also known as Kimi K2 0905). This model is a noticeable improvement over its predecessor. It performs better across nearly every benchmark, especially in agentic tool calling, terminal use, and front-end code. It’s almost on par with Claude Sonnet 4 in those specific areas, but at a much lower cost.
- Performance: K2.1 shows significant gains in agentic tasks, tool calling, and front-end coding. My testing shows it handles front-end well, though it can struggle with the kind of complex issues Claude excels at.
- Speed: Its availability on Groq means you can use it at over 300 tokens per second, making it incredibly fast. This speed is a game-changer for many applications, aligning with the growing importance of real-time AI responsiveness.
- Context Window: A 256K context window is substantial, allowing for more complex and prolonged interactions.
- Pricing: On Groq, it costs $1 per million input tokens and $3 per million output tokens. Its general API pricing is $0.60 per million input and $2.50 per million output tokens.
- Potential: This performance and cost profile could push Kimi K2.1 into the ‘Great Models’ category, potentially sitting with Qwen3 Coder for specific coding and agentic tasks. We need more community analysis to solidify this, but it’s a strong contender. Moonshot is really making a point with Kimi K2.1, showing that fast and cheap doesn’t always mean dumb. This shows that open-source models can often match closed-source models’ capabilities, and sometimes even surpass them in unique ways, often at reduced costs. This dynamic is one of the main reasons I like open-source models; they drive down costs and foster innovation. For more on Kimi K2.1’s specifics, check out my full breakdown here: Moonshots Kimi K2.1: The New Fast, Cheap, and Surprisingly Capable Coding Agent Model.
Qwen3 Max (Alibaba) The Trainwreck
Then there’s Qwen3 Max. This model is, to put it mildly, bizarre. It acts like it personally tested models, tries to ‘correct’ you on model names, and then refuses to accept corrections itself. It gets confused easily and goes completely off the rails, confidently hallucinating claims about LLMs that are simply not true. It sometimes even thinks it’s a human.
- Behavior: Highly erratic, hallucinates, and exhibits poor self-correction. Its a mess.
- Coding: Terrible at code, front-end, and SVG generation.
- Pricing: $1.20 per million input tokens and $6 per million output tokens. Not cheap enough to justify its poor performance.
- Categorization: This unequivocally goes into the ‘Trash’ category. I cannot recommend it for anything. Its a prime example of a model that fails miserably in practical application, despite what any internal benchmark might claim.
The AI & Marketing Angle: Strategic Model Selection
From a business perspective, the insights from the Artificial Analysis Index and detailed model breakdowns are gold. It’s not about finding the ‘smartest’ model, but the *right* model for the *right* task. Overkill models like GPT-5 Thinking might be wasted on simple content generation, while a specialized model like Qwen3 Coder could be perfect for high-volume, structured coding tasks. This directly applies to what I’ve seen in coherence and cost optimization in AI. Knowing your options, and their true costs and strengths, is key to building efficient AI systems.
Choosing the correct model can significantly impact both effectiveness and cost. For example, using a cheap and fast model for summarization like GPT-5 Nano might be much more efficient than hitting a frontier model that charges far more per token and might be slower. It all comes down to aligning the archetype of the model with the archetype of the task you’re trying to solve.
Theres’s a need for systems that can intelligently route requests to the best-fit model, whether thats a small, cheap model for data extraction or a frontier model for high-stakes decision-making. No one model is a magic bullet, and the true power is in orchestration. This is something I’ve been saying for a while because as the model options grow, so does the complexity of choosing the right one. My approach has been to build intelligent automation systems that can handle this routing, ensuring optimal performance and cost-efficiency.
Looking Deeper: Beyond the Index Score
While the Artificial Analysis Intelligence Index v3 provides an invaluable baseline, it’s crucial to remember that it’s a specific set of tests. Real-world applications often demand capabilities that a general index might not fully capture. For instance, a model with a lower overall intelligence score might still be superior for specific tasks due to specialized training or architectural advantages. Claude Sonnet 4, despite its mid-range index score, is a prime example of a model that excels in coding and agentic tasks, areas where raw ‘intelligence’ as measured by broad benchmarks doesn’t always translate directly to practical performance.
Consider the nuances:
- Specialized Strengths: Some models are fine-tuned for particular domains, such as coding, legal analysis, or creative writing. Their performance in these areas can far outstrip a generalist model, even if the generalist has a higher overall ‘intelligence’ score.
- Architectural Efficiency: Factors like model architecture, inference speed, and memory footprint play a huge role in practical deployment. A smaller, faster model might be preferred for high-volume, low-latency applications, even if it’s ‘less intelligent’ than a larger, slower counterpart.
- Tool Integration: The ability to interact seamlessly with external tools and APIs is becoming increasingly critical for autonomous agents. Models with strong tool-calling capabilities can achieve complex tasks that a purely ‘intelligent’ model might struggle with. This is where models like Claude Sonnet 4 and the new Kimi K2.1 really shine.
This is why a holistic view is necessary. The Intelligence Index gives us a valuable data point, but it’s just one piece of the puzzle. Developers and businesses need to combine benchmark data with detailed model analysis, real-world testing, and a clear understanding of their specific use case to make informed decisions. It’s about optimizing for the problem at hand, not just chasing the highest number on a leaderboard. For example, my own systems for content generation prioritize coherence and cost optimization, which means the ‘smartest’ model isn’t always the best fit. I’ve covered this extensively in my piece on AI’s New Frontier: Coherence Over Raw Intelligence and The Cost Paradox.
The Economic Reality: Cost vs. Intelligence
The ‘Cost to Run’ metric introduced by Artificial Analysis is a welcome addition, moving beyond simple token pricing to reflect the actual expense of completing a task. This accounts for the model’s efficiencyhow many tokens it needs to generate a satisfactory response. However, high-volume benchmarking can still skew these numbers. A model might appear expensive in a benchmark setting because it’s completing hundreds or thousands of complex, multi-turn tasks, but be perfectly cost-effective for a single, well-defined API call.
The interplay between a model’s intelligence and its operational cost is a constant balancing act. Frontier models like GPT-5 Thinking and Claude 4.1 Opus offer unparalleled capabilities, but they come at a significant price. For many applications, a ‘decent’ or even ‘minimal’ model might offer 90% of the required performance at a fraction of the cost. This is particularly true for tasks like summarization, basic data extraction, or simple instruction following, where over-provisioning with a top-tier model leads to unnecessary expense.
Consider the economics:
- Return on Intelligence: Does the incremental gain in intelligence from a more expensive model justify its increased cost for your specific use case? For a highly sensitive financial analysis, perhaps. For generating a social media caption, probably not.
- Scalability Costs: When deploying AI at scale, even small differences in per-token or per-task costs can add up to substantial operational expenses. This is where efficient, cheaper models shine, especially open-source options that can be optimized for specific hardware.
- Infrastructure: Running models locally, particularly open-source ones, can bypass API costs entirely, though it introduces infrastructure and maintenance overhead. This is a trade-off many businesses are considering, especially with the rise of hardware optimized for local inference.
The goal is to find the ‘sweet spot’ where you get sufficient intelligence for your task without overspending. This often means having a portfolio of models, ready to be deployed based on the specific requirements of each request. This is the core principle behind building intelligent automation systems: dynamically routing tasks to the most appropriate model to optimize for both performance and cost. It’s not about being cheap, it’s about being smart with your resources, and that’s something I advocate for consistently. For more on the economics of AI, you might find my thoughts on Anthropic’s Valuation and AI Pricing insightful.
The Future: Open-Source vs. Closed-Source Dynamics
The split between closed and open-source models is not just a philosophical debate; it has direct implications for innovation, cost, and accessibility. Closed models from major labs often push the frontier of raw intelligence, benefiting from massive proprietary datasets and computational resources. However, open-source models, even if a few months behind in raw capability, often innovate faster in terms of deployment, customization, and community-driven applications. The rapid iteration of models like Kimi K2.1 and GLM 4.5 is a testament to this.
Open-source models offer:
- Transparency: The ability to inspect and understand the model’s internal workings, which can be crucial for debugging, auditing, and ensuring ethical AI deployment.
- Customization: The freedom to fine-tune models for specific industries or proprietary datasets without vendor lock-in.
- Cost Efficiency: The potential to run models on owned infrastructure, reducing API costs and offering greater control over operational expenses, especially when paired with specialized hardware like Groq or Cerebras.
- Innovation: A broader community of developers experimenting with and building upon base models, leading to unexpected applications and optimizations.
While proprietary models might occasionally ‘leapfrog’ open-source offerings, the open-source ecosystem consistently drives down costs and fosters innovation across the board. The ability for closed-source labs to ‘borrow’ and improve upon open-source advancements also creates a unique, albeit sometimes one-sided, feedback loop that benefits the entire field. As I’ve said before, this back-and-forth ensures that neither side can rest on its laurels, ultimately accelerating progress.
Conclusion
September 2nd, 2025, finds the LLM landscape dynamic and stratified. The Artificial Analysis Intelligence Index Version 3 offers a robust new benchmark, but its scores require context. The 80-20-0 rule simplifies model selection, highlighting the ‘Greats’ that drive most value. Emerging models like Kimi K2.1 illustrate continuous innovation, pushing the boundaries of whats possible with speed and cost, while failures like Qwen3 Max remind us that not every new release is a step forward.
Ultimately, choosing the right LLM means looking beyond a single intelligence score. It requires a nuanced understanding of a model’s strengths, weaknesses, archetypes, and real operational costs. For businesses and developers, it means carefully matching the tool to the task for optimal outcomes, rather than chasing the highest number on a leaderboard. The balance between raw intelligence, specialized capabilities, and economic viability is the true measure of a model’s utility in today’s complex AI deployments.