The text 'WEIRD RELEASE' printed in black sans serif font on a pure white background

GPT-5.2 Day 2: Benchmark Kings or Regression Weirdos? The Big Model Smell vs. SimpleBench Fails

GPT-5.2 dropped December 11, 2025, and two days in, the community response is… confused. The model crushes certain benchmarks while face-planting on others. It has what I’d call ‘big model smell’ in some areas and ‘RL sloptimization vibes’ in others. This is a genuinely strange release.

The Benchmarks That Scream Scale

On paper, GPT-5.2 looks like a monster. The Pro variant hits 74.1% on GDPval, which measures performance on economic and knowledge work tasks. That’s above human expert level, running 11x faster at less than 1% of the cost. Claude sits at 59.6%, and the previous GPT-5 managed just 38.8%. This kind of performance—requiring outputs that experts prefer, which is hard to RL—is a classic sign of massive scale.

GDPval Benchmark Comparison

GDPval performance across models. GPT-5.2 Pro leads significantly, signaling major scale.

The coding benchmarks reinforce this idea of scale. GPT-5.2 Thinking hits 55.6% on SWE-Bench Pro, beating Claude Opus 4.5 and the previous GPT-5.1. On SWE-Bench Verified, it reaches 80.0%. Science performance via GPQA Diamond shows Pro at 93.2%. And on AIME 2025 math problems, GPT-5.2 Thinking scored a perfect 100%.

It also excels at visual ‘vibe checks,’ like superior frontend/UI code generation for 3D and SVG, and it can replicate physics tests like the ‘Hexagon bouncing ball.’ The new Chestnut image model can perfectly render code, UI, and technical text inside generated images, showing strong multimodal capabilities.

The Weird Regressions: SimpleBench and UI Fails

Here’s where the confusion starts. On SimpleBench, GPT-5.2 scores below Sonnet 3.7, a model that is nearly a year old. GPT-5.2 Pro barely beats the original GPT-5 on this benchmark. This is a massive regression in core capabilities for a flagship release.

SkateBench shows similar problems. After watching Theo’s video on this, the community is debating two possibilities: either the evaluation setup is flawed, or the model genuinely has weak spots in specific, simpler areas despite its overall size and power. The 40% price increase—API input went from $1.25 per million tokens on GPT-5.1 to $1.75 on GPT-5.2—suggests a larger, costlier model, making these regressions even more puzzling.

In my real-world testing, the pattern holds. The model is excellent at code analysis, big-picture brainstorming, and coming up with new ideas. But when it comes to implementation, especially UI, it is terrible. I’ve found myself switching back to Codex Max for complex brainstorming and then using Antigravity with Gemini 3.0 Pro for front-end work, sometimes bringing in Opus 4.5 for debugging. The model is powerful, but not a universal replacement.

The Hallucination Fix

The reduction in hallucinations is also a critical improvement. With browsing enabled, GPT-5.2 produces 0.8% incorrect claims versus 1.5% for GPT-5.1. Major errors dropped from 8.8% to 5.8%. The new ‘reasoning token support’ appears to be working, making the model more reliable for serious academic and business use cases (domain errors are down to 0.3% for Academic and 0.7% for Business).

Long context handling has also improved significantly. On MRCRv2 with 8 needles at 256k tokens, GPT-5.2 hits 77% accuracy compared to GPT-5.1’s roughly 30%. This makes it genuinely useful for complex document analysis, something I’ve covered before in the context of agent platforms and long context windows.

The RL Sloptimization Hypothesis

My read is that this is a classic case of RL sloptimization. OpenAI trained extremely hard to dominate high-profile, complex benchmarks like GDPval, AIME, and GPQA, likely by focusing reinforcement learning heavily on these specific tasks. This optimization created blind spots, leading to unexpected regressions on simpler, less-glamorous benchmarks like SimpleBench and SkateBench, and poor performance in real-world UI implementation.

This is a common tradeoff in the current AI development cycle. Models don’t just get smarter across the board; they get better at what they are measured on. The model is a behemoth in reasoning and analysis, but the optimization process seems to have stripped away some of the general-purpose stability needed for tasks like reliable UI generation or simple benchmark performance.

The fact that GPT-5.2 is still terrible at humor just confirms that raw intelligence doesn’t automatically translate to every creative skill. They clearly didn’t optimize for that.

Bottom Line and Workaround Strategy

GPT-5.2 is not a universal upgrade, but it is a step function improvement in high-stakes reasoning and agent workflow. For most users, the $20/month Plus tier is still the best deal in tech, giving access to the best reasoning model currently available.

The strategy for power users must be model routing: Use GPT-5.2 Pro for analysis, agent tasks, and complex reasoning where accuracy matters. Switch to competitor models like Gemini 3.0 Pro or Opus 4.5 for specific implementation tasks, especially front-end code, until OpenAI addresses these strange regressions. This weird release confirms that the future of AI is not one model, but a sophisticated system of model switching based on task requirements.