Pure white background. Centered black sans serif text 'GPT-5.2-Codex'.

GPT-5.2-Codex: Better Long-Horizon Agentic Coding, Bigger Diffs, and Stronger Defensive Security

OpenAI released GPT-5.2-Codex, a specialized version of GPT-5.2 tuned for agentic coding in Codex. This is aimed at the work that burns the most engineering hours: long sessions in big repos, big diffs like refactors and migrations, reliable tool use, and defensive security workflows where you need the model to keep its footing across many steps.

  • Available now: paid ChatGPT users in all Codex surfaces.
  • Next: API access planned in the coming weeks.
  • Also: invite-only trusted access pilots for vetted defensive cybersecurity professionals and organizations.

What OpenAI is optimizing for this time

Most coding model marketing is still centered on “write code from a prompt.” That’s not the hard part anymore. The hard part is agent behavior: staying consistent across hours, using tools without breaking the environment, handling failure loops, and doing large repo work without drifting into nonsense.

GPT-5.2-Codex is positioned around a few specific improvements that map cleanly to real workflows:

  • Long-horizon tasks via context compaction: keeping the important parts of the session available without hauling around every token forever.
  • Large code changes: better performance on refactors and migrations where you touch lots of files and correctness depends on not missing edge cases.
  • Windows environments: stronger agentic behavior in native Windows setups.
  • Vision for dev work: better at reading screenshots, diagrams, charts, and UI surfaces.
  • Reliability basics: long-context understanding, more dependable tool calling, and better factuality.

Benchmarks that at least resemble real agent work

OpenAI highlights state-of-the-art performance on SWE-Bench Pro and Terminal-Bench 2.0. These are closer to “can you ship a patch in a real repo” and “can you operate inside a terminal” than the usual one-file puzzle tests.

  • SWE-Bench Pro: given a repository and a realistic engineering task, generate a patch that solves it.
  • Terminal-Bench 2.0: agent tasks in terminal environments like compiling code, setting up servers, and running training jobs.

They also cite a related improvement on SWE-Bench Verified for GPT-5.2 variants: 80.0% versus 76.3% for GPT-5.1. That number is not the whole story, but it matches what a lot of teams want: fewer “almost correct” patches that die on integration, tests, or environment setup.

Bar chart comparing SWE-Bench Verified scores from 76.3 to 80.0 for GPT-5.1 vs GPT-5.2 variants

Reported SWE-Bench Verified improvement for GPT-5.2 variants versus GPT-5.1 variants.

Why context compaction is the core feature

If you’ve used coding agents beyond quick patches, you’ve seen the drift pattern:

  • It forgets an earlier constraint and reintroduces a bug it already fixed.
  • It keeps re-reading files it already summarized because the conversation is too long.
  • It changes direction mid-task because the plan got buried.

Native compaction is an attempt to make long sessions stable. The goal is not “stuff more tokens in,” it’s “keep the right details.” OpenAI also mentions a 256k token context with dynamic compaction, which implies the model can keep operating over long histories while compressing older state into something cheaper and more usable.

In practical terms, this is the difference between a model that can:

  • start a migration,
  • hit a failing test wall,
  • roll back part of the approach,
  • and still remember the original objective and constraints.

Big diffs: refactors and migrations

OpenAI is explicitly claiming stronger performance on large code changes, not just incremental edits. That matters because most “agent coding” failures don’t happen when writing a new file. They happen when the agent has to change ten files in a coordinated way and then chase second-order breakage.

If you want to use GPT-5.2-Codex in a way that fits that claim, give it work that has a clear boundary and a clear verification loop:

  • Refactor a module boundary and update all imports and call sites.
  • Migrate a deprecated library across the repo and update tests.
  • Convert a set of endpoints to a new auth middleware and verify via integration tests.

The prompt pattern that works best for this style of work is boring but effective:

  • state the objective,
  • state the constraints,
  • state the acceptance criteria,
  • and require a checkpoint plan before changes begin.

Windows support: not exciting, very useful

A lot of agent tooling assumes a Linux dev environment. Plenty of companies do not. OpenAI calls out improved performance in native Windows environments, building on what they introduced in GPT-5.1-Codex-Max.

If your team is stuck on Windows for corporate reasons, this is the difference between “we can try it” and “we can standardize it.” A model that can navigate PowerShell quirks, path conventions, and Windows-native build steps saves time that usually gets wasted on environment mismatch.

Vision improvements are a practical developer feature

OpenAI claims stronger vision performance for screenshots, technical diagrams, charts, and UI surfaces. They also mention large reductions in errors on chart reasoning and software interfaces.

This tends to show up in three real workflows:

  • Design mock to prototype: turning UI mocks into components with less guessing.
  • Bug reports with screenshots: mapping a visual regression to the most likely CSS or layout culprit.
  • System diagrams: translating architecture diagrams into code boundaries and tickets.

Cybersecurity: stronger capability, more dual-use exposure

OpenAI says GPT-5.2-Codex has stronger cybersecurity capabilities than any model they’ve released so far, with sharp jumps on evaluations such as Professional Capture-the-Flag. They also say it does not reach their “High” threshold under the Preparedness Framework, but they’re planning deployment as if future models will cross that line.

Line chart showing reported cybersecurity evaluation progress from 27 to 76 across GPT-5-Codex, GPT-5.1-Codex-Max, and GPT-5.2-Codex

Reported jumps on a core cybersecurity evaluation across recent Codex model releases.

The real-world story OpenAI points to is a good example of what “defensive acceleration” looks in practice. Security researcher Andrew MacPherson used GPT-5.1-Codex-Max with Codex CLI while studying React2Shell, a React Server Components vulnerability tracked as CVE-2025-55182. During the process of reproducing and probing the system using standard defensive workflows like environment setup and fuzzing, he found unexpected behaviors that led to three additional vulnerabilities, which were responsibly disclosed and later published by the React team on December 11, 2025.

That’s a defender workflow: reproduce, set up a harness, test assumptions, fuzz, validate, disclose. It’s also exactly why OpenAI is pairing these capability gains with more safeguards and staged access. Better capability helps defenders move faster, and it also lowers the barrier for misuse.

Access and a practical CLI starting point

If you are already using Codex, GPT-5.2-Codex is available now for paid ChatGPT users across Codex surfaces. OpenAI says API access is coming soon, and they’re also piloting invite-only trusted access for vetted defensive security work.

If you use Codex CLI, the release report notes you can point it at GPT-5.2. Example:

codex -m gpt-5.2

Where this fits in the broader OpenAI release cycle is consistent with what I wrote during the base model launch: these updates compound value when you’re using the tool daily. If you want that context, see GPT-5.2 Is Live: Why $20/Month Is the Best Deal in Tech Right Now. And if you’re thinking about how this lands inside companies, the bigger shift is agents moving from “nice demo” to “part of the stack,” which I covered here: Enterprise AI Adoption in 2025: From Casual Chat to Core Infrastructure.

My take

I’m sticking with my current workflow of Opus 4.5 for everything because I can get pretty much unlimited Opus 4.5 access by paying the $20 Google subscription. This is a much better usage than getting the Anthropic subscription directly.

This model is extremely capable. The only downside is that you have to use it in Antigravity, which isn’t the best tool, but it’s better than any other tool with any other model. Sometimes I’ll use Gemini 3 to clean up the UI afterward because I like the way it handles UI.

Besides that, Opus is king, and this doesn’t change that. In fact, it’s a regression on MLEBench, which measures Machine Learning Engineering. The only test I had with it gave me a really terrible UI for a tool that didn’t work. So, this model is nothing special. Maybe it will solve some issues that other models can’t, but nobody’s default is moving from Opus 4.5.