Cinematic photos of a comparison table on AI model performance, featuring high photorealism, close-up shots of the table, and a soft focus background highlighting AI technology.
Created using Ideogram 2.0 Turbo with the prompt, "Cinematic photos of a comparison table on AI model performance, featuring high photorealism, close-up shots of the table, and a soft focus background highlighting AI technology."

BREAKING: NEW Claude 3.5 Sonnet

Introduction

The launch of the “Claude 3.5 Sonnet (new)” model by Anthropic has prompted a closer look at its performance against other leading AI models, including Claude 3.5 Sonnet, Claude 3.5 Haiku, GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash. This analysis highlights where each model excels and where they fall short across various tasks.

Performance Overview

Graduate Level Reasoning

The new Claude 3.5 Sonnet leads with a score of 65.0%, significantly outperforming the Haiku version at 41.6%. However, its margin over the original Sonnet and Gemini 1.5 Pro is narrow. This suggests that while improvements exist, the competition remains fierce.

Undergraduate Level Knowledge

With a score of 78.0%, the new Claude 3.5 Sonnet also shines in undergraduate knowledge tasks, slightly surpassing Gemini 1.5 Pro. This model shows a promising capability to handle academic-level inquiries effectively.

Code Proficiency

In coding tasks, the new Claude 3.5 Sonnet excels with a remarkable 93.7%. This performance is not only the best among the models but also highlights its potential as a go-to resource for developers.

Math Problem-Solving

While the new Sonnet performs well at 78.3%, it still trails behind Gemini 1.5 Pro at 86.5%. It’s clear that for mathematical challenges, Gemini has a slight edge.

High School Math Competition

In a surprising turn, the new Sonnet performs poorly with only 16.0%, but it does outshine the Haiku and original Sonnet. Clearly, more work is needed in this area.

Visual Q/A

Here, the new model stands strong with 70.4% accuracy, closely followed by GPT-4o. This area of performance indicates promising capabilities in visual tasks, making it a versatile option.

Agentic Coding and Tool Use

Performance in agentic tasks reveals room for improvement. The new Sonnet scores 49.0% in coding and 69.2% in tool use, signaling that while there’s potential, the model still has gaps to fill.

Conclusion

The Claude 3.5 Sonnet (new) shows significant promise across various domains, particularly in coding and undergraduate knowledge. However, competition remains strong, particularly with Gemini models in math problem-solving. As AI continues to evolve, these performance metrics will guide users in selecting the right model for their specific tasks. Meanwhile, the anticipated release of the Claude 3.5 Opus is delayed, but I am confident it will be an impressive addition to the lineup, keeping Anthropic at the forefront of AI advancements. For further insights on AI advancements, check out my previous posts on DALL-E 3, HAIPER 2.0, and Mistral AI.