OpenAIs seventh Codex is a model: GPT-5-Codex (low/medium/high) lands as the default brain inside Codex

OpenAI just shipped another Codex. This time, its the model itself. GPT-5-Codex now powers Codex cloud tasks and code review by default, and you can pick it for local work in the Codex CLI and IDE extension. Its tuned for agentic coding, with an effort dial at low, medium, or high that trades speed for deeper reasoning.

The seventh thing named Codex, now the default brain

Heres the context for anyone trying to keep the naming straight. There are now seven different things called Codex:

  • Codex platform in ChatGPT
  • Codex-1 model
  • Codex-1-Mini model
  • Codex CLI
  • Original 2021 Codex model
  • Codex IDE extension
  • GPT-5-Codex (low/medium/high)

Yes, thats a lot of Codex. The point: GPT-5-Codex is the new model at the center of the stack, not just autocomplete. Its trained on real software work: building projects, adding features and tests, debugging, large refactors, and code review.

What GPT-5-Codex is built to do

GPT-5-Codex is a GPT-5 variant specialized for agentic coding inside Codex environments. Instead of tokens spent on chatty filler, it is built to spend more cycles when a task gets hard, and to breeze through the easy stuff. OpenAIs examples say bottom-decile turns use about 93.7% fewer tokens than GPT-5, while top-decile turns spend roughly 2 longer on planning, editing, testing, and iteration. That heads-down behavior shows up on big refactors and multi-step fixes where it keeps going until tests pass. OpenAI reports runs that lasted more than seven hours during testing.

Relative token use chart

Token use is illustrative: easy turns compress to ~6.3 vs GPT-5 at 100, hard turns expand to ~200 for deeper reasoning and testing.

How it performs on refactors and real issues

OpenAIs internal refactor evaluation shows a lift from 33.9% (GPT-5 high) to 51.3% (GPT-5-Codex high). Thats a double-digit jump on a task developers do constantly: reshaping existing code while keeping behavior and tests intact. On SWE-bench Verified it posts a smaller improvement over GPT-5. Treat all of this as promising, with outside replication still pending. I care more about whether it can grind through a messy multi-module change with reliable test outcomes than a single static score, but the refactor bump is exactly the area where a coding agent should show value.

Refactor evaluation lift chart

Refactor accuracy example on a proprietary evaluation. Independent replication is still pending.

Autonomy and effort control

OpenAIs claim is that GPT-5-Codex can settle into long runs on complex tasks and keep iterating until tests pass. That kind of persistence matters on refactors, migrations, and cross-cutting fixes. The key control you have is effort: low, medium, or high. In Codex tools this is not a style toggle. It affects how long the agent plans, edits, runs code, and re-tests before it returns.

  • Low: tiny edits, quick file ops, simple shell tasks.
  • Medium: most day-to-day feature work and straightforward bug fixes.
  • High: multi-module refactors, migrations, and gnarly debugging where failing tests are your compass.

If you want speed on tap for minor changes and staying power for larger jobs, this is the knob you will touch the most.

What changes in code review

GPT-5-Codex is trained to act as an additional reviewer across a whole repo, not just a single file diff. It can navigate dependencies, run code and tests, and file comments that target the issues that matter. The point is not to replace a human reviewer. The value is coverage: catching logic errors, performance footguns, security gotchas, and consistency problems that slip late in the week.

If you use it this way, prompt discipline matters. Avoid giant magic prompts and wild instruction dumps. They add noise and increase the odds of the model reacting to stray instructions in code or docs. When you want structured, parseable comments or autofixes, specify the format up front. The structured output guideline is basic but powerful; see the guidance on structured outputs at bridgemind.ai and a useful primer on system and format prompts on medium.com. For teams new to prompt specificity, a short reminder from linkedin.com applies here: say the format you want and the medium youre in, instead of leaving it implied.

Front-end and visual context

The model is stronger on front-end tasks and mobile sites. In cloud runs, it can look at your screenshots, spin up a browser to inspect its own work, and attach screenshots back to tasks and PRs. That makes design intent much clearer for the agent. If you hand it a design image plus component constraints, you tend to get tighter CSS, fewer layout wobbles, and faster iteration because it can compare its output to the screenshot rather than guessing.

Where you can use it today

  • Default in Codex cloud tasks
  • Default for code review
  • Selectable in Codex CLI and the IDE extension
  • Included with ChatGPT Plus, Pro, Edu, Business, and Enterprise seats

It is not a general API model yet. For CLI users who call models via an API key, OpenAI says GPT-5-Codex is coming to the API soon. Today, youll reach it through Codex tooling.

Practical playbook for teams

1) Treat it as an additional reviewer

Attach GPT-5-Codex to PRs to catch missed errors and enforce standards. Keep a short rubric file in the repo so the agent has a clear policy on tests, performance budgets, and security rules. For structured feedback, ask for JSON objects containing severity, location, message, evidence, and suggested_fix. This maps cleanly to autofix tooling and makes triage easier; the structured output tip is a consistent win per bridgemind.ai and medium.com.

2) Use effort control intentionally

Pick low for quick taps. Use medium for most day-to-day changes. Reserve high for refactors, migrations, and deep debugging where you want the agent to run tests repeatedly and keep iterating. This is the main behavioral knob youll care about in practice.

3) Keep prompts short and explicit

Skip mega prompts. They bloat tokens and increase the odds of instruction confusion when the model reads code or docs that include command-like strings. Keep prompts crisp, and provide a compact spec instead of prose. A simple rule that mirrors advice from linkedin.com: specify the output format and the tool context.

4) Front-end: give it visuals

Attach the screenshot of the expected view and let the agent run in cloud with a browser. Youll get better layout accuracy and fewer back-and-forth cycles because it can compare output to the screenshot. When possible, include a short component catalog or design token reference.

5) Calibrate trust, not just accuracy

Benchmarks are helpful, but confidence calibration matters in code changes. If you want a deeper dive on why calibration beats raw percent-correct for production use, I wrote about this in ConfidenceBench: Calibrating LLM Confidence, Not Just Accuracy.

Availability, naming clarity, and how to get started

Access today is through Codex tools. If you are already using the Codex platform in ChatGPT, you are on the path of least resistance. For local work, the Codex CLI and the Codex IDE extension expose GPT-5-Codex as a selectable option. If you are still wrapping your head around OpenAIs naming choices, I covered some of that earlier in OpenAI Codex IDE Extension: When AI Coding Meets Confusing Product Names.

Why this matters

Codex keeps pulling more of the coding loop into OpenAIs stack: interactive pairing for small tasks, sustained autonomous work for major refactors, and code-review coverage in GitHub. GPT-5-Codex is the model that ties those modes together. It shifts its effort based on task complexity, which is exactly what you want from an agent inside a dev environment.

Caveats

  • Availability: its inside Codex today, not the general API.
  • Benchmarks: refactor lift is notable, SWE-bench gains are modest, and independent replication is still pending.
  • Prompting: sloppy prompts will give you sloppy outcomes; keep them short and specify output formats.

What Id do on day one

  1. Enable GPT-5-Codex for PRs as an additional reviewer with a short JSON schema for comments.
  2. Adopt effort levels: low for trivial tasks, medium as the default, high for refactors and deep debugging.
  3. For front-end work, attach screenshots, run in cloud, and let it attach screenshots to the PR.
  4. Pick one high-impact refactor that has good tests and measure cycle time and review fixes.

Bottom line: this is a better coding agent model inside Codex, especially for refactors and long-running tasks. If you are already in the Codex ecosystem, treat GPT-5-Codex as the default choice and tune effort by task.

Links

They're clicky!

Follow me on X Visit Ironwood AI →

Adam Holter

Founder of Ironwood AI. Writing about AI stuff!