Claude Opus 4.5 is the first time Anthropic’s top tier model actually looks practical for day-to-day work instead of just special cases. Pricing dropped to $5 per million input tokens and $25 per million output tokens, and more importantly, the model uses far fewer tokens than Sonnet 4.5 to reach the same or better result. In a lot of real workflows, Opus 4.5 will quietly be cheaper than Sonnet while giving you frontier-level reasoning.
What actually changed with Claude Opus 4.5
Anthropic’s own write-up and early customer feedback line up around a few concrete shifts:
- Pricing is finally sane for Opus. $5M / $25M tokens moves it out of “only for the hardest bugs” territory and into “default for serious work” territory.
- Token efficiency is the real headline. Opus 4.5 reaches higher scores than Sonnet 4.5 on tough coding and agent benchmarks while using fewer tokens to get there.
- Effort control gives you a real cost vs depth dial. The new effort parameter on the Claude API lets you decide whether you want Opus to think lightly and fast or to grind harder on a problem.
- Context and tools are built for agents. 200K context, context compaction, memory, and advanced tool use are tuned around long-horizon, multi-step workflows rather than one-shot answers.
- Safety and prompt injection robustness stepped up. On Gray Swan’s strong injection tests, Opus 4.5 is harder to trick than other frontier models, which matters once you connect it to real tools and data.
If you skipped earlier Opus generations because invoices hurt, this is the first one that really deserves a fresh cost analysis.
Token efficiency is the real story
Anthropic’s own benchmarks plus partner feedback keep repeating the same thing: Opus 4.5 solves harder tasks with fewer tokens and fewer dead-ends.
- On SWE-bench Verified, at medium effort, Opus 4.5 matches Sonnet 4.5’s best score while using 76% fewer output tokens.
- At high effort, it beats Sonnet 4.5 by 4.3 points and still uses 48% fewer output tokens.
- On long-horizon coding tasks, one partner reports up to 65% fewer tokens with higher pass rates.
- Another team saw 50% to 75% drops in tool calling errors and build or lint errors, which means fewer wasted iterations and less junk in your traces.
This lines up with a theme I talked about in my TOON vs JSON post: at this stage, token efficiency is not just about prompt format tricks. It is about how much wandering the model does internally to reach a good answer. Opus 4.5 seems to do far less wandering.
When you combine a cheaper per-token price with needing far fewer tokens, Opus 4.5 ends up in a strange but very welcome spot: for serious work, it can be the cost control option while still being the strongest Anthropic model.
Cost math: when Opus 4.5 is cheaper than Sonnet 4.5
The interesting part is how often Opus 4.5 wins on total cost, not just on quality. Take a simple mental model: assume Sonnet 4.5 uses 100 units of output tokens on a given coding task and Opus 4.5 uses 24 units at medium effort. Even if they were priced the same, that is a huge gap.
Now layer in Anthropic’s new pricing. Opus is no longer the extreme luxury tier. You get higher pass rates and fewer retries while also shrinking the trace size. The partners reporting 50% to 75% cuts in tool call and build or lint errors are describing exactly that effect: fewer failed attempts means fewer prompts and fewer wasted tool invocations.
This is the same pattern I care about in system design work: you do not just look at raw token price or benchmark scores in isolation. You care about how many prompts and tool calls it takes to actually finish a workflow. Opus 4.5 pushes that product in the right direction.
Effort control: you finally get a thinking dial
The new effort parameter is simple but important. Instead of swapping models to get cheaper or deeper runs, you can tell the same model how hard to think:
- Low effort – quick answers, minimal chain-of-thought, aimed at chatty UX or simple transformations where you care more about latency and cost than squeezing the last bit of quality.
- Medium effort – the default that already matches Sonnet 4.5 on tough coding benchmarks while saving most of the tokens.
- High effort – longer reasoning traces, more branches explored, better for scary refactors or multi-agent orchestration where fixing a mistake is expensive.
That “effort as a knob” story matches what partners are seeing. One SQL-heavy workflow described Opus 4.5 as feeling dynamic instead of overthinking: at lower effort it still meets their quality bar but with much less waste.
If you are building your own agents, this is also a cleaner interface than trying to wire in separate “fast” and “slow” models. You can keep the same mental model of how Claude thinks and just tune the depth.
Agents, coding, and long-horizon work
Anthropic is very clearly steering Opus 4.5 toward agentic workloads: coordinated tools, long-running plans, multi-agent setups. A few concrete data points from partners and Anthropic’s own testing:
- Long-horizon coding – better pass rates with up to 65% fewer tokens on held-out tests, and fewer dead-ends on Terminal Bench with a 15% jump over Sonnet 4.5.
- Office and enterprise automation – self-improving agents for office workflows reached peak performance in four iterations with Opus 4.5, while other models could not hit the same level even after ten.
- Excel and financial modeling – internal evals saw 20% higher accuracy and 15% better efficiency, which is exactly where errors and retries get expensive.
- Code review and refactors – reviewers report more issues caught without extra noise, and successful refactors spanning multiple codebases and agents.
This is very aligned with what I argued in my GPT-5.1-Codex-Max xhigh write-up: the interesting part is not “can it write code” anymore. The interesting part is whether it can plan, call tools, and adapt across a 30 minute or longer run without falling into a ditch. Opus 4.5 looks tuned for exactly that sort of work.
The multi-agent angle matters too. Opus 4.5 is particularly good at supervising teams of sub-agents, helped by Anthropic’s context management and memory tools. In deep research evals that used context compaction and memory, they report roughly a 15-point jump in performance compared to not using those tools at all.
Safety and prompt injection robustness
As these models start driving tooling and touching real systems, safety is not an abstract research topic anymore. You actually care whether your agent can be tricked by a web page or a document into doing something dumb.
Anthropic’s internal evals point to lower “concerning behavior” scores for Opus 4.5 than any of their earlier models, plus industry-leading robustness on Gray Swan’s strong prompt injection tests. That means it is harder to convince Opus 4.5 to ignore system instructions when you pipe untrusted content into its context.
This connects to a thread I pulled on in my SEAL post about self-adapting models: once you let models run long, tune themselves, or operate as agents, the guardrails and evals matter more than the novelty of the technique. Anthropic is at least showing their work here with a detailed system card.
Product surface: Claude Code, Chrome, Excel, desktop
On the product side, Anthropic is quietly fixing one of the big issues I called out in my Google Gemini piece: having a strong model does not help much if the product surface is fractured.
With Opus 4.5, they updated:
- Claude Code – Plan Mode now asks clarifying questions up front, writes a
plan.mdyou can edit, then executes. It is also available in the desktop app, so you can run multiple local and remote sessions at once. - Claude app – long chats no longer slam into a hard context wall, because the app auto-summarizes older turns as needed.
- Claude for Chrome and Excel – Chrome automation is available to all Max users, and Excel integration is opened up more broadly to Max, Team, and Enterprise users, which pairs well with the stronger spreadsheet performance.
This is not flashy marketing, but it is the boring kind of product work that actually makes a strong model usable.
Where Opus 4.5 actually fits
Opus 4.5 does not flip the table on everything. It is a new, better Claude that happens to hit an important point on the curve: high-end reasoning, strong agent skills, and enough token efficiency that you no longer have to treat it as a rare luxury.
If you are already on Sonnet 4.5, I would think about it this way:
- Use Opus 4.5 for large refactors, multi-agent systems, production code review, serious Excel or financial modeling, and long research or planning threads where retries are expensive.
- Keep Sonnet or smaller models for quick chats, throwaway scripts, and high-volume low-risk tasks where raw cost and latency still matter more than squeezing out extra reasoning depth.
- Use open-source models when you need privacy, custom hosting, or ultra-low cost, like I have argued before in my pieces on models like OLMo 3 32B Think and HunyuanVideo 1.5.
The key point: Claude Opus 4.5 finally makes sense as a default choice for heavy coding and automation, not just a model you pull out on special occasions. The model is stronger, yes, but the bigger shift is that the token economics and effort control finally line up with how people actually work.