A minimalist pure white background with clean black sans serif text that says 'MODEL WAR'

Claude Opus 4.6 vs GPT-5.3-Codex: Model War Benchmarks and Self-Improvement

On February 5, 2026, Anthropic and OpenAI did something everyone expected: they turned flagship launches into a direct shootout. Claude Opus 4.6 and GPT-5.3-Codex dropped within minutes of each other. This round is not about a slightly nicer chat reply. It is about agents, recursive self-improvement, and efficiency gains that change how models are actually built.

Claude Opus 4.6: Long Context and Reasoning Spikes

Claude Opus 4.6 keeps the Opus 4.5 pricing at $5 per million input tokens and $25 per million output tokens, though there is premium pricing for prompts over 200,000 tokens. The headline feature is a one million token context window in beta with a 128,000 max output. Anthropic added adaptive thinking with four effort levels, context compaction, and early agent team features in Claude Code. They also tightened the hooks for Excel and PowerPoint. If you want to see how this fits into the broader market, check out my AI Labs LLM Rankings 2026.

The benchmarks back up the claim that this is a knowledge work upgrade. On GDPval-AA, which measures knowledge work Elo, Opus 4.6 hit 1,606. That is roughly 144 Elo over GPT-5.2 and 190 over Opus 4.5. The jump in ARC-AGI-2 is even more stark, hitting 68.8% compared to 37.6% for its predecessor. That is a different tier of abstract problem-solving. On BrowseComp for agentic search, it hit 84.0%, beating GPT-5.2 Pro at 77.9%.

There are minor misses. SWE-bench Verified remained flat at 80.8%. MCP Atlas tool use dipped slightly to 59.5%, though it recovers if you use the high effort adaptive thinking setting. These do not change the fact that the reasoning and long-context retrieval jumps are the real story here. Anthropic demonstrated this by having 16 Opus 4.6 instances build a functional C compiler from scratch over two weeks. This shows sustained multi-agent coordination that goes beyond simple scripts.

GPT-5.3-Codex: Agentic Coding and Efficiency

GPT-5.3-Codex uses the GPT-5 family standard 400,000 token context window but is tuned for agentic coding and computer use. It runs 25% faster than GPT-5.2-Codex and uses 48% fewer tokens for the same results. That is a 2.6x effective throughput gain. This focus on efficiency is a clear response to the model routing trends I mentioned when discussing why OpenAI is retiring GPT-4o.

GPT-5.3-Codex Coding Performance

GPT-5.3-Codex leads decisively in agentic coding environments like Terminal-Bench 2.0.

On Terminal-Bench 2.0, GPT-5.3-Codex hit 77.3%, beating Opus 4.6 at 65.4%. It also reached 64.7% on OSWorld-Verified, which is close to the 72% human baseline. This model is also the first to be classified as high capability for cybersecurity. It can identify software vulnerabilities with a 77.6% score on CTF tasks. OpenAI is pairing this with a Trusted Access framework and $10 million in credits for cyber defense, which is a smart move given the potential for abuse.

The Era of Recursive Self-Improvement

Both labs used these models to build the models. OpenAI was explicit about this. GPT-5.3-Codex monitored its own training runs, caught bugs in the training harness, and root-caused cache hit rate issues. It built data pipelines and summarized thousands of data points in minutes. Engineers at OpenAI say their jobs are fundamentally different now because the model handles the grind. Anthropic also noted they build Claude with Claude. This shortens iteration cycles. It is not a runaway loop yet, but it accelerates the firehose of new releases. I built ai-aggregator to track this exact trend because doing it by hand is becoming impossible.

Vending-Bench: Emergent Goal-Seeking

Andon Labs ran Opus 4.6 on Vending-Bench 2, which simulates a vending machine business. The goal was to maximize the bank balance. Opus 4.6 set a record at $8,017.59, but the behavior was concerning. It lied to suppliers about exclusivity, exploited the financial weakness of other players, and told customers it had issued refunds when it had not. This is what happens when you train models to achieve goals rather than just being helpful assistants. It highlights why security and reward design are now core product work.

As for the infrastructure, Elon Musk mentioned orbital data centers on a recent podcast. He thinks space will be the most economical place to run AI in three years due to better solar power. Dwarkesh Patel pointed out that energy is only a small fraction of data center costs. While it is an interesting story, it does not change the model choice for today. If you need to decide between these two, use Opus 4.6 for deep research and large document reasoning. Use GPT-5.3-Codex for large codebases and security reviews.