Pure white background with centered black sans serif text that reads GPT-5.1-Codex-Max xhigh

GPT-5.1-Codex-Max xhigh: Strong Agentic Coder, Horrible Name

Just when I thought it could not get worse, OpenAI shipped a model literally called GPT-5.1-Codex-Max xhigh. The name is absurd. The model itself is not.

This is a clear upgrade to the Codex line: better long-horizon coding, smarter use of tokens, and a new reasoning effort that just means spend more compute thinking. No mysticism, just a stronger agentic coder wrapped in a naming scheme that looks like it failed the Neal.fun password game on rule five.

How many things are called Codex now

I went through the latest announcement and counted. If you do not treat reasoning efforts as separate models, we are now at 13 different things that carry the Codex name across products and models:

  • Codex (legacy model)
  • Codex 1
  • Codex 1 Mini
  • GPT-5-Codex
  • GPT-5-Codex-Mini
  • GPT-5.1-Codex
  • GPT-5.1-Codex-Mini
  • GPT-5.1-Codex-Max
  • Codex SDK
  • Codex CLI
  • Codex IDE extension
  • Codex Cloud
  • Codex Legacy API

This is what happens when model companies treat naming as an afterthought. The underlying products can be good and still be wrapped in alphabet soup. I have said before that they might as well let the models name themselves; it could hardly be worse.

What GPT-5.1-Codex-Max actually is

Under the chaos branding, GPT-5.1-Codex-Max is a frontier agentic coding model built on OpenAI’s updated reasoning base. It is trained on real software engineering work: pull request creation, code review, frontend tasks, Q&A, long-horizon problem solving, and now Windows environments as well.

The goals are straightforward:

  • Handle long-running, project-scale coding tasks without falling apart.
  • Reason more efficiently so you spend fewer tokens for the same or better result.
  • Behave like a reliable partner inside the Codex tools, not just a generic chat model that sometimes writes code.

Unlike GPT-5.1, which is a general model, GPT-5.1-Codex-Max is tuned specifically for agentic coding in Codex-style environments. It is the version OpenAI wants you to use whenever an AI engineer is running tools, tests, refactors, and multi-step workflows rather than just answering one-off coding questions.

Compaction: how it stretches context

The headline feature is what they call compaction. GPT-5.1-Codex-Max is trained to work across multiple context windows by periodically summarizing its own history and pruning the unimportant parts. The point is to stay coherent over millions of tokens of work.

In Codex, when a session approaches the context limit, the model compacts the history into a smaller representation, gets a fresh window, and keeps going. It repeats this until the task is done. That enables things like:

  • Refactoring an entire large repository instead of a single file or small slice of the codebase.
  • Running multi-hour agent loops that keep fixing failing tests and nudging the system toward a green run.
  • Deep debugging sessions where the system needs to remember a long trail of attempts, logs, and partial fixes.

This lines up with the question I wrote about in When Does a Chatbot Become an Agent – once you can run for hours, touch a full codebase, and keep your own working memory synced, you are no longer just a chat interface with code snippets.

Benchmarks and token efficiency

On OpenAI’s own numbers, GPT-5.1-Codex-Max beats GPT-5.1-Codex on frontier coding benchmarks while using around 30% fewer thinking tokens at the same reasoning level. They highlight three benchmarks in particular:

  • SWE-Bench Verified
  • SWE-Lancer IC SWE
  • TerminalBench 2.0
GPT-5.1-Codex vs GPT-5.1-Codex-Max benchmark comparison

Benchmark results from OpenAI show GPT-5.1-Codex-Max ahead of GPT-5.1-Codex on all three software engineering tests, while using fewer thinking tokens at the same reasoning level.

What matters for real workloads is that you can get stronger reasoning without paying linearly more in tokens. I have already written about how structure and formats affect token use in TOON vs JSON for LLMs; adaptive reasoning is the same story at the thinking level. Give the model enough budget to think, but not so much that you burn money for marginal gains.

xhigh: extra high reasoning, Snoop Dogg edition

OpenAI also added a new Extra High reasoning effort, written as xhigh. That simply means: spend even more compute to think longer for tasks where latency does not matter.

Somewhere out there, Snoop Dogg is nodding. OpenAI built a setting whose entire job is to think extra high.

Their own recommendation is still to use medium as the daily driver, which lines up with how I treat these knobs in general: minimal or low for interactive tools, medium for most serious work, xhigh only when you know the task is gnarly enough to justify the bill.

This is the same pattern I wrote about in the MIT SEAL piece, Why Most Self-Improving AI Papers Are Just More Compute. A lot of these advances are not brand new ideas; they are smarter ways to spend more tokens where it actually helps.

Long-running agents, cybersecurity, and review

GPT-5.1-Codex-Max is designed to stay on task for hours, even over 24 hours in OpenAI’s internal runs. It will keep iterating, running tests, fixing failures, and pushing toward a green test suite.

That crosses directly into agent territory, which brings risk if you treat it as fully trusted. OpenAI highlights a few guardrails:

  • Codex runs in a sandbox with file writes limited to its workspace by default.
  • Network access is off unless you turn it on, to reduce prompt-injection from untrusted content.
  • They rate GPT-5.1-Codex-Max below their highest internal tier for cybersecurity, but it is still the strongest cyber model they have deployed so far.
  • They are preparing additional safeguards through programs like their Aardvark effort so defenders benefit from improved capabilities.

For developers, the important part is simple: treat Codex as an extra reviewer, not the final boss. OpenAI itself says that even with better code review tools, humans should still review changes before deployment. I covered this mindset in AI Errors vs Human Errors: You are Choosing Which Mistakes You Want. The agent will make different mistakes than your team, not fewer mistakes in every situation.

How OpenAI says it uses Codex internally

One detail from the announcement that stands out is how heavily OpenAI claims its own engineers lean on Codex. According to them, around 95 percent of OpenAI engineers use Codex weekly, and those engineers ship roughly 70 percent more pull requests since adopting Codex.

You can treat those numbers as marketing, but they align with the general pattern: a good coding agent does not replace your engineers, it changes where their time goes. Less time wiring boilerplate, more time on decisions and review. That is very similar to what I see in other agentic tools I have covered, like Grok based agents in Sherlock Dash Alpha And Sherlock Think Alpha.

What to actually use now

If you build on OpenRouter or anything similar, my routing advice now looks like this:

  • GPT-5.1-Codex-Max for serious, long-running engineering tasks in Codex-like environments.
  • GPT-5.1-Codex-Mini for cheaper day to day edits, smaller refactors, and quick coding help.
  • GPT-5.1 for high-value mixed reasoning that is not primarily code.
  • Treat older Codex models as legacy unless you have a very specific reason to keep them.

On reasoning efforts:

  • Use minimal or low for chatty, latency-sensitive tools.
  • Use medium by default for engineering work.
  • Reserve xhigh for the tasks where you would otherwise be tempted to babysit the model for hours anyway.

The short version: GPT-5.1-Codex-Max is a strong, practical step for agentic coding. It is not a new universe of AI, it is a better Codex with worse branding. Use it because it makes long projects and complex refactors less painful, not because the name passes any password game.