I had to make my AI benchmark harder because every top model was hitting 100%. Claude 4 Sonnet, Claude 4 Opus, and Gemini 2.5 Pro were all maxing out the scoring. When your benchmark gets saturated, it’s time to raise the bar.
The results from the new version are clear: Claude 4 Opus is in a completely different league. Not just marginally better. Qualitatively different in how it approaches complex problems.
But here’s what actually surprised me. During my Make.com module generation test – one of the hardest challenges where the AI needs to research documentation and build working modules – Claude 4 Opus didn’t just succeed. It taught me something about Make.com’s API that I didn’t know existed.
The Discovery That Changed My Workflow
Most Fal.ai requests require constant polling – you send a request, it goes into a queue, and you have to keep checking until it’s complete. In Make.com, this means either guessing timing or setting up complex fallbacks that burn through operations. Claude 4 Opus found a synchronous endpoint in their API that handles the entire request-wait-response cycle in one module. I had no idea this existed.
That’s the difference between a model that follows instructions and one that actually understands what you’re trying to accomplish. Claude 4 Opus looked at the problem, researched the documentation, and found a better solution than what I was using in my own automations.
The Overeagerness Problem: Finally Fixed
Claude 3.7 Sonnet was maddening. Ask for a small CSS alignment fix, and it would restructure your entire database. Request a minor bug fix, and it would rewrite half your codebase. The overeagerness was out of control.
Claude 3.5 Sonnet had this issue too, but 3.7 was that problem on steroids. You couldn’t trust it with targeted changes because it would always go nuclear on your code.
Anthropic claims they’ve fixed this with Claude Sonnet 4, and from my testing, they’re right. The model makes the changes you actually ask for without the unnecessary architectural overhauls.
Gemini vs Claude: UI Polish vs Complex Logic
My benchmark revealed an interesting split. Gemini 2.5 Pro consistently delivered cleaner-looking user interfaces, but Claude 4 Opus dominated complex logic and tool use scenarios.
For front-end work where visual polish matters – HTML, CSS, JavaScript with Tailwind CDN – Gemini often produced better-looking results. The layouts were cleaner, the styling more polished.
But when the task required deep reasoning, complex integrations, or sophisticated tool use, Claude 4 Opus left everything else behind. It’s not even close.
This makes sense. Gemini seems optimized for producing visually appealing outputs. Claude 4 Opus appears optimized for actually solving complex problems correctly.
Why the Make.com Test Actually Matters
Most AI benchmarks test toy problems. My Make.com module generation challenge tests something that matters: can the AI research unfamiliar documentation, understand a complex API, and build something that actually works in production?
The task requires the model to:
- Research Make.com’s module structure requirements
- Find and understand third-party API documentation
- Build a valid module that integrates both systems
- Handle authentication, error cases, and data mapping
Claude 4 Opus was the only model that consistently nailed this. But more importantly, it often found better approaches than the obvious ones.
The Fal.ai synchronous endpoint discovery is just one example. While other models would implement the polling pattern I expected, Claude 4 Opus dug deeper and found a more elegant solution.
What This Means for Business Automation
The jump from Claude 3.7 Sonnet to Claude 4 Opus represents a fundamental shift in what’s possible with AI automation. Previous models required heavy hand-holding. You’d write detailed prompts, provide extensive examples, and still need to review and fix their output constantly.
Claude 4 Opus changes that equation. You can give it high-level goals and trust it to figure out the implementation details. It researches what it needs to know, finds the best approaches available, and delivers working solutions.
For automation work specifically, this matters enormously. When building content automation systems or workflow integrations, you need an AI that can handle the unexpected edge cases and API quirks that come up constantly. Claude 4 Opus actually can.
The Naming Confusion (Again)
Anthropic is switching from Claude {Model Number} {Tier} to Claude {Tier} {Model Number}. So instead of Claude 4 Sonnet, it’s now Claude Sonnet 4. The change isn’t consistent across their marketing materials, which creates confusion.
This arbitrary naming change is typical for AI companies. OpenAI does it constantly. Google does it. They seem to enjoy making things unnecessarily complicated for developers who just want stable, predictable model names.
But whatever they call it, Claude 4 Opus is exceptional.
The Bottom Line
After testing all these models extensively on practical, real-world tasks, Claude 4 Opus stands alone at the top. It’s not just incrementally better – it’s qualitatively different in how it approaches complex problems.
For automation work, content generation, and sophisticated integrations, this is the model to use. The combination of advanced reasoning, excellent tool use, and the ability to actually learn from documentation makes it uniquely valuable.
Gemini 2.5 Pro remains strong for UI work. Claude Sonnet 4 is solid for general use cases and has fixed the overeagerness problem. But when you need the best possible results on complex tasks, Claude 4 Opus is the clear choice.
The future of AI-powered business automation just got a lot more interesting.