The word 'SPARK' printed in large bold black sans-serif font on a pure white background with a small speedometer icon drawn in black to the right of the text

GPT-5.3-Codex-Spark: 1000 Tokens Per Second, But Is It Actually Faster?

OpenAI released GPT-5.3-Codex-Spark on February 12, 2026, a smaller version of GPT-5.3-Codex built for real-time coding. The headline number is over 1000 tokens per second, served on Cerebras Wafer Scale Engine 3 hardware. That’s roughly 15x faster than the flagship GPT-5.3-Codex, which runs around 65-70 tokens per second. It’s available as a research preview for ChatGPT Pro users in the Codex app, CLI, and VS Code extension.

This is the first product of the OpenAI-Cerebras partnership that was announced in January 2026. The model has a 128k context window, is text-only, and is designed for interactive coding work: targeted edits, refining interfaces, reshaping logic, and getting near-instant responses while you iterate. It doesn’t auto-run tests unless asked, and it defaults to a lightweight working style with minimal edits.

The Speed Numbers Look Great on Paper

Beyond the raw token speed on Cerebras, OpenAI also rolled out pipeline-wide latency improvements that will benefit all Codex models. They introduced a persistent WebSocket connection, reduced client/server roundtrip overhead by 80%, cut per-token overhead by 30%, and improved time-to-first-token by 50%. These infrastructure improvements are currently enabled by default for Codex-Spark and will become the default for all models soon.

Pipeline Latency Reductions

Infrastructure improvements shipping with Codex-Spark, coming to all models soon.

Benchmarks: Fast, But Not as Capable

On benchmarks, Codex-Spark trades accuracy for speed, and the trade-off is visible. On SWE-Bench Pro, it scores around 56% compared to GPT-5.3-Codex’s 56.8%, which is close enough. But on Terminal-Bench 2.0, the gap widens: 58.4% for Spark versus 77.3% for the flagship. It does comfortably beat GPT-5.1-Codex-mini’s 46.1% on Terminal-Bench, so it’s not a weak model by any stretch. It’s just not the frontier.

Benchmark Accuracy Comparison

Codex-Spark nearly matches the flagship on SWE-Bench Pro but falls behind on Terminal-Bench 2.0.

The Actual Problem: Token-Happy and Tool-Happy

Here’s where the marketing and reality diverge. I’m on the $20 plan, not the $200 Pro plan, so I can’t test Codex-Spark myself. But the reviews coming in tell a consistent story: the model is way too aggressive with tool calls and token usage. GPT-5.3-Codex is surgical. It uses the tools it needs, completes the task efficiently, and moves on. Codex-Spark, despite its faster inference speed, calls far more tools than necessary, generates more tokens than it should, and often ends up being slower at completing the actual task than the flagship model it’s supposed to complement.

OpenAI demoed it building a snake game. Models have been able to build a snake game for years. That tells you nothing about how it handles real codebases with complex dependencies, multi-file refactors, or the kind of software engineering work people are actually using Codex for in 2026. On those tasks, the excess tool calls and verbose output eat into whatever time savings the raw inference speed provides.

Where It Does Work

Not all the feedback is negative. For research-style tasks, quick lookups in a codebase, generating docstrings, basic refactoring, or any task where you don’t need frontier-level precision, Codex-Spark’s speed advantage is real and useful. If you’re doing a lot of lightweight, iterative work where the model’s intelligence ceiling doesn’t matter much, 1000 tokens per second makes a noticeable difference in how the interaction feels.

The problem is that for the complex tasks where Codex shines most, where you want it to carefully reason through a problem and make clean, minimal changes, the Spark variant is worse and sometimes not even faster. That’s a tough sell for a speed-focused model.

The Cerebras Partnership and What’s Next

The Cerebras integration is interesting from an infrastructure perspective. GPUs remain foundational for OpenAI’s training and inference pipelines, but Cerebras’ Wafer Scale Engine 3 fills a niche for workflows demanding extremely low latency. OpenAI says GPUs and Cerebras can be combined for single workloads to hit the best performance, which suggests this is more of a complement than a replacement for their existing GPU fleet.

OpenAI’s roadmap for this line includes larger models, longer context lengths, and multimodal input. The vision is a Codex that blends real-time interactive mode with long-horizon autonomous agents: you stay in a tight loop with Spark while it delegates heavier tasks to sub-agents running the full GPT-5.3-Codex in the background. That’s a compelling direction. The question is whether they can make Spark’s real-time mode reliable and efficient enough that developers actually want to use it instead of just waiting for the flagship model to finish.

If you’re interested in how GPT-5.3-Codex stacks up against other frontier coding models, I covered that in my comparison of Claude Opus 4.6 vs GPT-5.3-Codex and in the broader 2026 LLM rankings.

Availability Details

Codex-Spark is available now for ChatGPT Pro users at $200/month. It has its own separate rate limits during the research preview, and usage doesn’t count against standard limits, though you might hit queuing during high demand. There’s also limited API access for design partners, with broader access planned in the coming weeks. Safety evaluations confirmed it doesn’t reach the Preparedness Framework threshold for high capability in cybersecurity or biology.

Bottom Line

GPT-5.3-Codex-Spark is a fast model. That part is true. But fast inference doesn’t always mean fast task completion, and that’s the disconnect right now. The model’s tendency to over-call tools and generate excessive tokens undermines its speed advantage on anything beyond basic tasks. For research, quick lookups, and lightweight coding, it’s a nice option to have. For real software engineering work, GPT-5.3-Codex is still the one to use, and it often finishes the job faster despite being 15x slower per token. Speed means nothing if the model takes a longer path to get there.