GPT-5.4 Beats Claude 4.6 Opus in 5D Chess

GPT-5.4 can play 5D chess. It beat Claude 4.6 Opus in a match that ran close to five hours. The models worked from plain text descriptions of the board state. They selected their own moves without any rules engine or interface. Pick an illegal move and the game ends in your loss.

This match matters because 5D chess demands real long-horizon planning. The game adds time travel and parallel timelines to regular chess. You can move pieces into the past or create branching futures. Checkmate any version of the opponent across those timelines and you win. The board count multiplies with each time shift. What starts as one board quickly becomes several connected realities. Each decision must account for effects that ripple backward and sideways through those realities at once.

Even GM Hikaru Nakamura found it tough. The state space explodes. With almost no existing game data available the models could not rely on memorized patterns. They had to generate workable strategies during the match itself. That points to generalization rather than recall from training data.

The eval avoided hand holding on purpose. Text input only. Independent move selection. This setup reveals more than padded interfaces that hide gaps in understanding. It forces the models to internalize the rules first and then apply them across extended sequences of decisions. The five hour length shows both systems held focus far beyond typical short puzzles.

Standard Chess Shows the Baseline

Frontier models already handle regular chess at a decent level. Earlier tests put GPT-5.4 ahead of Claude 4.6 Opus on piece captures, board control and decision speed. GPT-5.4 finished with about 59 seconds of total thinking time compared to roughly 10 minutes for Claude. That speed difference reached 10x while producing stronger results. These outcomes match what we see in benchmarks that test endgames, tactics and complete games against capable opponents.

Standard chess performance comparison

The chart makes the gap clear. Lower time paired with higher score favors GPT-5.4. Such head to head numbers give concrete reference points instead of isolated benchmark scores.

Mechanics Behind 5D Chess

The game came out in 2020 from Thunkspace. It adds two new axes. One tracks time progression. The other tracks separate timelines that branch when players shift pieces through time. Pieces treat those dimensions the same way they treat ranks and files on a normal board. A move can reach back to change an earlier position or split off a new future. You must defend against threats in any timeline because checkmate on one board ends the entire game.

That creates massive complexity. Every action carries consequences that spread across multiple histories. Plans must stay consistent even as the number of active boards grows. The game starts simple on one board but scales fast. Holding all that in working memory while searching for good moves tests exactly the kind of sustained reasoning that matters for harder tasks. Models must plan several steps ahead not just on the current board but on boards that do not yet exist or that represent altered pasts.

Complexity comparison across chess variants

This chart illustrates the scale of the jump. The state space grows by orders of magnitude with each added dimension. Standard chess already presents a tree too large for brute force. 5D versions push that far past what even strong search can cover without smart pruning and long term structure recognition.

What the No Handholding Setup Reveals

By giving only text and requiring valid self selected moves the test removes crutches. Many interfaces quietly validate or suggest options. This version counts any mistake as total loss. That isolates whether the model truly understands the rule interactions or is simply guessing within a guided frame. Both models sustained play for hours which indicates they tracked the branching states effectively enough to avoid early collapse.

I see this as consistent with how these models actually work. They compress patterns from text including causal relationships that never get spelled out directly. The structure of the world shapes how we describe events. Good prediction of that text gives models an implicit sense of how rules interact. 5D chess forces them to apply that sense to a brand new rule set. The result supports the idea that LLMs carry workable models of cause and effect even in domains far from their training distribution.

This matches what I have observed across other areas. Smaller models trip on precise numerical or logical steps. Frontier versions clear higher bars when the task plays to their strengths in pattern compression and inference. The same pattern appears in planning. They can track long sequences when the rules are given clearly even if the specific domain is unfamiliar. The 5D match supplies one more data point that top models generalize on novel complex systems.

Why the Eval Matters for Progress

Standard benchmarks have grown saturated. Many tasks now risk data contamination or reward memorization over fresh reasoning. 5D chess dodges that problem. Sparse training data on the variant means any coherent strategy has to form on the spot. The five hour duration shows both models maintained coherence over an unusually long interaction without external scaffolding.

It also aligns with other tests that involve multi step optimization under constraints. Success here signals movement toward systems that can manage open ended planning without constant external guidance. GPT-5.4 came out ahead in this head to head. That fits its recorded edges on speed and certain coding benchmarks while Claude 4.6 Opus holds leads on some deep reasoning tests. Unusual contests like this one help sort the practical differences between the two.

The rivalry between the labs keeps delivering these data points. One side gains on coding speed and efficiency metrics. The other pushes certain reasoning metrics higher. Direct comparisons on games with exponential complexity cut past the separate leaderboards. Five hours of play across branching timelines gives a concrete outcome that standard matches no longer provide.

Continued work on evaluations with minimal prior data will sharpen the picture. The current frontier already passes tests that looked unreachable a few generations ago. This 5D match follows that line. It shows top models can absorb complex rule systems from text and generate novel approaches during extended play. The absence of full logs does not erase the core signal from the winner and the time invested.

I keep coming back to the same point when I look at results like this. LLMs already carry implicit models of how causes lead to effects. Text encodes those relationships at scale. When you drop a new game with clean rules on top of that foundation the better models extract what they need and start building plans. The 5D result reinforces exactly that view and suggests these evaluations scale toward more advanced testing as models improve.

Expect more matches of this type. They expose capabilities that standard chess no longer reaches. For anyone tracking real planning progress this kind of test supplies a clearer signal than many conventional benchmarks now can.

Links

They're clicky!

Follow on X →Ironwood →
Adam Holter
Adam Holter

Founder of Ironwood AI. Writing about AI models, agents, and what's actually happening in the space.