Blueprint-Bench 2 Reveals the First Signs of 3D Spatial Intelligence

Andon Labs released Blueprint-Bench 2. The results show the first measurable signs of 3D spatial intelligence in frontier models on a task that produced nothing but noise seven months ago.

The benchmark works like this. An agent receives around 20 photos taken from inside an apartment. It must output a coherent floor plan that correctly identifies each room, maps the connections between them, and maintains consistent scale throughout. This runs across 50 different apartments. The agent has access to a notepad where it can record observations and improve its approach as it moves from one unit to the next. The original version delivered random guesses from every major model. The random baseline sat at 0.279. Humans scored 0.547. Outputs bore no relationship to actual layouts.

The new leaderboard looks different. GPT 5.5 sits in first place. Gemini 3.1 Pro follows, then Claude Opus 4.7. Humans continue to outperform the best models by a clear margin. Most other systems still land below random. The distance between the leaders and the pack remains substantial. Yet the top models now clear the threshold where genuine spatial deductions appear.

The GPT 5.5 example stands out. It received two photographs of the same bedroom taken from separate angles. One image showed a door opening into one neighboring space. The second image showed a door opening into a different neighboring space. The model did not dismiss the data as contradictory. It did not hallucinate extra walls. It concluded the bedroom functioned as a through-room that connects the two adjacent areas. This behavior indicates the model constructed and revised an internal three-dimensional representation based on incomplete visual information. That step crosses from pattern matching into actual spatial reasoning.

The original Blueprint-Bench demonstrated that surface fixes could not solve the problem. Extra prompting, agent scaffolding, and iterative refinement all failed because the underlying spatial understanding simply was not present. Models could not maintain consistency across views or infer connectivity. The appearance of this capability in a limited form after seven months therefore carries weight. It points to scaling and training advances beginning to fill a core gap rather than paper over it.

Gemini Robotics ER deserves specific attention. Its training targeted spatial and embodied reasoning. Every indication suggested it should dominate this evaluation. It scored below the base Gemini 3 Flash version instead. The same pattern appeared on Butter-Bench. Specialized pathways for physical world tasks have not yet produced advantages on abstract spatial reconstruction. Anyone building toward robotics should note this gap. Design choices that look obviously correct on paper do not always translate into benchmark gains.

I track evaluations of this type because 3D spatial intelligence forms a necessary foundation for robotics systems that operate in homes, warehouses, or unpredictable environments. Partial observations must translate into accurate layouts and navigation decisions. Robotics in turn represents one of the more concrete routes to AI systems that deliver transformative economic impact. When models begin to register success on tasks like this it supplies a concrete data point about development speed and the timeline on which certain safety considerations may become relevant.

The results also align with how models appear to build understanding of the physical world. Text encodes countless traces of spatial relations, object permanence, and causal structure even when those facts receive no explicit labels. Stories, instructions, and descriptions reflect how gravity, connectivity, and scale behave. Models that compress and predict that distribution extract the regularities. Extending the same process to visual inputs and consistent three-dimensional layouts represents a logical continuation of the same mechanism. The through-room deduction shows a model applying that extracted knowledge to resolve ambiguity across multiple photographs.

Benchmarks like Blueprint-Bench cut through marketing language. The task remains fixed. The scoring criteria stay objective. Direct comparison across time becomes possible without relying on vague statements about internal capability. The seven-month gap between total failure and initial logical deduction offers a cleaner read on current rates of change than isolated scores on simpler tests. The original evaluation proved that no amount of clever prompting could substitute for missing spatial representation. The fact that top models now succeed in limited cases suggests the underlying deficits are gradually being addressed through continued improvements in scale, data, and training methods.

This does not alter which model anyone should select for writing, coding, or day-to-day knowledge work. Those use cases rest on different strengths. Yet the directional signal matters for longer-range decisions. Applications involving simulation, virtual staging, or eventual physical interaction benefit from accurate spatial understanding. Builders who anticipate robotics or embodied AI pathways now have a clearer reference point for when relevant capabilities may cross practical thresholds. The gap with human performance stays large. Most models still fail outright. What exists today is movement from zero competence toward narrow success. That movement deserves documentation.

I have argued before that small gains on carefully designed benchmarks warrant attention. The difference between random layouts and correctly inferring room function from visual inconsistency appears modest in isolation. In context it indicates models acquiring the ability to reason about space in ways that mirror their growing competence with code or language. The notepad component adds another layer. By allowing the agent to accumulate observations across apartments, the benchmark tests whether patterns generalize rather than reset with each new environment. Early success here suggests the learning process can compound rather than fragment.

Practical expectations should stay anchored. These systems cannot yet direct robots through novel physical spaces with reliability. The capability remains narrow and brittle. Outputs that succeed on 50 apartments may collapse when furniture moves or lighting changes. The through-room example represents a genuine advance but sits far from the fluid, effortless spatial cognition people exercise constantly. Continued testing on follow-up versions of this benchmark will reveal whether the gains accelerate, compound, or encounter new plateaus. I plan to watch the next iteration with interest.

The full leaderboard and example outputs live at https://andonlabs.com/evals/blueprint-bench-2. The detailed traces show exactly where current systems succeed and where they still produce incoherent floor plans. This kind of transparent evaluation helps the field move past announcement-driven narratives toward concrete measurement. For those following AI development the data supplies one more calibration point on the timeline toward systems that can interact meaningfully with the physical world. The progress qualifies as real. It also qualifies as early. Both facts belong in the same assessment.

One additional note on world models. Recent discussion sometimes frames video-style simulators as replacements for language models in reasoning. The Blueprint-Bench results reinforce a different picture. Language models already extract substantial physical regularities from text distributions. When those same models begin succeeding on visual spatial tasks the boundary between language understanding and spatial understanding starts to blur. The primary reasoning substrate likely remains centered on language-model-style architectures with additional modalities and simulators layered alongside for training and validation. This benchmark provides one data point supporting that view rather than a clean separation between text and world models.

Links

They're clicky!

Follow on X →Ironwood →
Adam Holter
Adam Holter

Founder of Ironwood AI. Writing about AI models, agents, and what's actually happening in the space.