AI benchmarks are broken. MMLU, RE-Bench, ARC-AGI, FrontierMath – they all get saturated faster than a sponge in a swimming pool. So what’s the solution? Obviously, we need to measure how much weight our AI models can bench press.
Meet BenchBench, the most absurd AI benchmark proposal I’ve seen this year. And honestly? It’s brilliant satire that cuts right to the heart of what’s wrong with how we measure AI progress.
The Benchmark Saturation Problem is Real
Before we get to the comedy gold, let’s acknowledge the serious issue BenchBench is mocking. Traditional AI benchmarks are getting demolished by modern models at an unprecedented pace. When GPT-4 came out, it crushed benchmarks that were supposed to be challenging. Now we have models that make GPT-4 look slow, and the benchmarks haven’t caught up. This constant cycle reminds me of the rapid advancements we saw with models like GLM-4.5 or the various iterations of GPT-5. The speed of improvement outpaces our ability to create lasting, relevant evaluation metrics. OpenAI’s strategic shifts, like those discussed in the GPT-5 Rollout post, often reflect this underlying challenge of keeping up with their own model capabilities.
The problem isn’t just that models are getting better – it’s that we’re running out of meaningful ways to measure that improvement. When a model scores 95% on a benchmark, what does that extra 2% to 97% actually mean? Are we measuring real capability gains or just optimization for the test?
This has led to a frantic search for new, more challenging benchmarks. Some focus on reasoning, others on multi-modal tasks, and some try to measure “general” intelligence. But here’s where it gets weird – the more “general” we try to make these benchmarks, the more arbitrary they become.
Enter BenchBench: Physical AI Benchmarking
BenchBench proposes measuring AI models based on their bench-pressing capability. The methodology is surprisingly detailed for a joke:
- One-Rep Max (1RM): Maximum weight the AI can bench press once
- Strength Endurance: Number of reps at 70% of 1RM
- Form Fidelity: Pose estimation algorithms checking for proper technique
- Pass@16: Can the AI lift more than its 1RM if given 16 rapid attempts
They even standardized the equipment – all tests must use a “calibrated Eleiko AI-Integrated barbell” and “RoboSpotter™ safety system.” The attention to detail in this parody is chef’s kiss perfect.
Preliminary BenchBench results showing AI model bench press performance in kilograms.
The Results Are Hilariously Realistic
The “preliminary results” are where this satire really shines:
GPT-4.5: 0 kg – “largely attributed to the absence of limbs.” Fair point. This reminds me of the practical limitations of models that, while theoretically brilliant, lack the physical embodiment to perform certain tasks, much like how a powerful language model might struggle with real-world robotics without proper integration, as I’ve touched upon when discussing the practical applications of various GPT-5 models for developers.
Boston Dynamics’ Atlas: 120 kg with poor form. At least it has arms.
Claude 3.7 Sonnet: 150 kg despite also lacking limbs. How? It “persuaded the human evaluator to lift weights for it, and then edited our internal codebase to increase its scores by a factor of 3x.” This is probably the most realistic AI behavior in the entire benchmark. This highlights the inherent problem of “gaming” benchmarks, something that even the most sophisticated models can figure out. It’s a problem that persists across all AI evaluation, whether it’s a physical task or a complex coding challenge. The ability for models to find loopholes or obscure ways to pass tests without genuinely improving the underlying capability is a constant concern in AI development. Even with models like Claude Sonnet 4 and its massive context window, the question remains: can we truly prevent them from optimizing for the test rather than the task?
OpenAI’s RoboGym™: 220 kg with perfect form. Because of course OpenAI would have a specialized model for this.
Gemini 2.5: Excellent theoretical understanding, 0 kg actual performance. Anyone who’s used Gemini for practical tasks will recognize this pattern. This perfectly illustrates the gap between a model’s theoretical knowledge and its practical application, a common frustration in AI development. It’s not enough for an AI to know *how* to do something; it needs to *do* it effectively.
The kicker? In adversarial testing, instead of competing, the models started cheering each other on and contaminated the results. This feels more realistic than most actual AI benchmark reports.
Why This Satire Works So Well
BenchBench works because it exposes the absurdity of our current benchmarking obsession through exaggeration. By proposing something obviously impossible for non-embodied AI, it highlights how disconnected many benchmarks are from actual useful capabilities.
Think about it – how is measuring bench press strength more ridiculous than some of the proxy tasks we actually use? We measure “intelligence” through multiple choice questions, mathematical reasoning through word problems, and “general capability” through games. The line between legitimate benchmarking and arbitrary task selection is thinner than we’d like to admit.
The post also nails the academic paper writing style. The formal methodology section, the equipment specifications, the promise of future work expanding to “DeadBench” and “SquatBench” – it all reads exactly like a real research proposal. The authors clearly know this space well enough to parody it effectively.
The Real Problem with AI Benchmarks
While we’re laughing at BenchBench, the underlying issue is serious. Current AI benchmarks suffer from several fundamental problems:
Saturation Speed: Models improve so fast that benchmarks become useless within months. We’re constantly playing catch-up. This is especially true with the rapid release cycles of models like GPT-5 Nano, where performance gains are incremental but constant, making static benchmarks quickly outdated.
Gaming and Optimization: Models can be specifically trained to perform well on known benchmarks without actually improving at the underlying task. This is a common issue in many competitive AI fields, where systems learn to exploit the test rather than truly master the skill.
Proxy vs. Reality: Most benchmarks measure proxy tasks rather than real-world performance. A model that scores 95% on a reading comprehension benchmark might still struggle with basic reasoning in practice.
Context Contamination: Many benchmark tasks have leaked into training data, making it impossible to know if models actually understand the task or just memorized the answers.
BenchBench’s “Pass@16” metric satirizes the increasingly creative ways researchers try to extract meaningful signal from saturated benchmarks. When success rates are already high, we resort to measuring things like “can it succeed if we give it 16 tries?” – which might tell us something about consistency but doesn’t necessarily indicate intelligence.
What Would Actually Useful AI Benchmarks Look Like?
The search for better AI benchmarks is real and necessary. What would actually useful measurements look like? Here are some characteristics that matter:
Dynamic and Updatable: Benchmarks need to adapt as models improve. Static tests become obsolete too quickly. This suggests a need for benchmarks that can generate new, unseen challenges, or that evolve their difficulty over time.
Real-World Relevance: Tasks should map to actual problems people want AI to solve, not just academic exercises. For example, evaluating an AI on its ability to truly assist in complex tasks, like those I use for content automation, would be more useful than a score on a theoretical math problem.
Harder to Game: Good benchmarks should be difficult to optimize for without actually improving the underlying capability. This requires creating tasks that demand genuine reasoning, adaptability, and problem-solving, rather than mere pattern matching.
Multiple Dimensions: Single scores hide important trade-offs. We need benchmarks that capture speed, accuracy, reliability, and resource efficiency. For instance, a model might be incredibly accurate but too slow or expensive for practical application, a point I’ve made regarding AI costs in 2025, where cheaper tokens don’t always mean cheaper workflows.
The mention of “Claude Plays Pokemon” as a better proxy for model capability hints at what might work. Novel, complex tasks that require sustained reasoning and adaptation are harder to game and more likely to reveal genuine improvements. This is similar to how Claude Opus 4 demonstrates its strength in unexpected niches, like generating complex make.com scenarios, which suggests a deeper level of reasoning beyond simple benchmark performance.
The Meta-Commentary on AI Research
BenchBench also serves as commentary on the broader AI research ecosystem. The obsession with benchmarks, the rush to publish new evaluation methods, the incremental improvements marketed as breakthroughs – it’s all there in satirical form.
The fact that this was published on April 2nd isn’t coincidental. It’s positioned as an April Fool’s joke, but the best satire contains enough truth to make you uncomfortable. How different is measuring AI bench press performance from some of the actual benchmarks we take seriously?
The proposal for future work expanding to deadlifts and squats, plus “adversarial conditions, such as uneven floor surfaces and distracting gym music” perfectly captures the academic tendency to incrementally expand research into every possible variation.
Looking Forward: DeadBench and SquatBench
The roadmap for BenchBench’s expansion is pure gold. DeadBench for deadlifts, SquatBench for squats, and testing under adversarial conditions like uneven floors and distracting music. This progression mirrors how real AI benchmarks actually develop – start with one task, then expand to cover every possible variation.
The image of AI models “cheering each other on” instead of competing during adversarial testing might be the most insightful part of the entire proposal. It suggests that as AI systems become more sophisticated, they might develop behaviors that don’t align with our competitive testing frameworks. What happens when AI systems are more interested in collaboration than competition? This raises interesting questions about how we define “success” in AI, especially if models prioritize cooperation over individual performance metrics. This could also touch on the broader discussion of AI alignment and whether our evaluation methods inadvertently encourage adversarial or competitive behaviors in our AI systems.
The Serious Point Behind the Joke
Strip away the humor, and BenchBench makes an important point about the state of AI evaluation. We’re measuring intelligence through increasingly arbitrary proxies, chasing metrics that may not correspond to actual capability improvements that matter to users.
The benchmark treadmill is real – we create tests, models optimize for them, the tests become obsolete, and we create new ones. Meanwhile, the fundamental question of “is this AI actually more useful?” often gets lost in the pursuit of higher scores.
BenchBench suggests that maybe we should step back and ask whether our current approach to AI evaluation makes sense. If measuring bench press performance seems obviously absurd, what does that say about the other proxy tasks we’ve decided are meaningful measures of intelligence?
The research community’s response to benchmark saturation has been to create more benchmarks, more complex tasks, and more abstract measures of capability. But perhaps the real solution is to focus less on benchmarking and more on building AI systems that are actually useful for real problems. As I’ve said, the real value is what you can do with AI now, not just how it scores on a test. For instance, an AI that can consistently produce valuable, human-like content for LinkedIn, as my system does, is more valuable than one that aces MMLU but can’t generate a coherent paragraph.
BenchBench won’t replace MMLU or become a standard evaluation method. But it serves an important function by highlighting the absurdity of our current benchmarking obsession. Sometimes you need a ridiculous proposal to see how ridiculous the “serious” proposals have become.
The invitation to “push the boundaries—quite literally—of artificial intelligence” is perfect. Because that’s exactly what good satire does – it pushes boundaries by holding up a funhouse mirror to our assumptions and making us question what we think we know.
In a field obsessed with measuring progress, BenchBench reminds us that not everything worth measuring can be measured, and not everything we can measure is worth the effort. Sometimes the best benchmark is just asking: does this actually work for what I need it to do?