The SenseMath Paper Tested Budget Models and Called It a Verdict on AI

A paper out of the University of Notre Dame called SenseMath is making the rounds with the claim that LLMs don’t have number sense. The original thread pushing it hit 237K views. The conclusion being drawn by people reposting it is that AI can’t do math, reasoning models are marketing, and you shouldn’t trust LLMs with numerical work.

The paper itself is measuring something real. It built a controlled benchmark of 4,800 items across eight shortcut categories, tested whether models apply numerical shortcuts when appropriate, avoid them when inappropriate, and can generate valid shortcut-bearing problems from scratch. Those are meaningful distinctions. The findings — that models adopt shortcuts when explicitly prompted but rarely do so unprompted under standard chain-of-thought, and that they over-generalize shortcuts to problems where they don’t apply — are worth knowing about.

The problem is the model selection.

SenseMath tested models by approximate parameter scale

SenseMath tested exactly five models: GPT-4o-mini, GPT-4.1-mini, Qwen3-8B, Qwen3-30B, and Llama-3.1-8B. Every single one is a mini, budget, or sub-10B open-source model. Qwen3-30B is the largest, and it’s still a mid-tier open-weight model nowhere near frontier. The paper’s abstract frames the evaluation range as “GPT-4o-mini to Llama-3.1-8B” — which is technically a range, but it’s a range between the low end and the low end.

None of the models that people actually use for serious numerical and analytical work were tested. Not GPT-5.2. Not Claude Opus 4.5. Not Gemini 3 Pro. Not DeepSeek V3.2. The paper simply didn’t test them. That gap matters enormously for how the findings should be interpreted. The paper’s conclusion — that models exhibit “procedural shortcut fluency without the structural understanding of when and why shortcuts work” — may well be true for GPT-4o-mini and Llama-3.1-8B. It says nothing about whether that pattern holds for frontier-tier models, which have substantially different training regimes and capability profiles.

One commenter in the thread put it well: this is like taking a bicycle to a drag race and publishing a paper saying vehicles are slow. Another compared it to testing 1950s cars and concluding cars can’t hit 60mph. Both comparisons land because they capture the actual structure of the problem. You can’t draw conclusions about a class of objects by testing only the weakest members of that class and treating the results as universal.

The counterargument that comes up in situations like this is that papers take months or years to reach publication, so outdated models are inevitable. That’s a real constraint. Academic publication timelines are genuinely long, and compute budgets are genuinely limited. But that constraint should affect how confident you are in the findings, not how broadly you circulate them. If the paper is constrained to budget models, the headline should be “budget LLMs show weak number sense” — not “LLMs don’t have number sense.”

The social media framing made that leap without hesitation. The thread that drove most of the traffic stated flatly that the paper “just proved LLMs don’t have number sense at all” and that “reasoning models are mostly marketing.” Those are claims about frontier systems that the paper provides zero evidence for. The paper didn’t test reasoning models. It tested the budget tier and published a result about the budget tier.

The actual findings, taken at face value within their proper scope, are useful. Models that rely on cheaper, smaller architectures do seem to exhibit the shortcut over-generalization pattern the paper describes. That’s relevant if you’re deploying those models. The recommendation to use actual calculators and code execution for arithmetic rather than raw LLM outputs is reasonable and has been reasonable for a while regardless of what this paper says. Don’t let the model do arithmetic alone on anything that matters — use tools. That’s just correct practice.

But there’s a meaningful difference between “here’s a limitation of budget models that practitioners should know about” and “AI can’t do math and reasoning is marketing.” The SenseMath paper is closer to the first. The thread pushing it made it sound like the second, and 237K people saw that framing before most of them ever looked at the methodology.

This is a pattern worth noticing in how AI research spreads. Papers that test weak models and reach negative conclusions about AI capabilities tend to spread faster than papers that test strong models and find more nuanced results. Partly that’s because negative findings feel counterintuitive given the current climate of AI enthusiasm, so they feel revelatory. But the strength of a finding is not determined by how counterintuitive it is. It’s determined by what was actually tested. This dynamic is similar to what happens with benchmark releases — see the discussion around ARC-AGI-3, where the framing of results outpaced what the methodology actually supported.

The benchmark itself is publicly available at the GitHub repo linked in the paper. Anyone can run it against GPT-5.2, Claude Opus 4.5, or Gemini 3 Pro. That reproduction hasn’t happened at scale yet. Until it does, the paper’s findings should be understood as applying to the specific five models tested — not to LLMs as a category. The paper is a reasonable contribution to the literature on numerical reasoning in smaller models. The claim that it proved something fundamental about LLMs broadly is not supported by the methodology, and that distinction matters every time someone cites this paper as evidence that AI can’t do math.

Links

They're clicky!

Follow me on X Visit Ironwood AI →

Adam Holter

Founder of Ironwood AI. Writing about AI stuff!