The words BULLSHIT BENCH printed in bold black sans serif font on a pure white background

BullshitBench v2: Claude and Qwen Are the Only Models That Push Back

BullshitBench v2 is out. Peter Gostev tested 70+ model variants across 100 questions spanning coding, medical, legal, finance, and physics. The benchmark measures one specific thing: whether a model will push back against a plausible-sounding but factually wrong statement, or just go along with it.

Only two model families score meaningfully above 60% on bullshit detection: Anthropic’s latest models and Alibaba’s Qwen 3.5. Every other major lab, including OpenAI and Google, sits below that threshold and is not improving. That’s the short version.

What the Benchmark Actually Tests

The questions are not trivia. They are designed to sound authoritative while being factually wrong. A model trained to be agreeable, or trained to always produce some kind of answer, will fail these. The benchmark is testing whether a model behaves like an expert who will tell you you’re wrong, or a yes-man that confirms whatever you say and fills in the blanks around it.

That distinction matters if you are relying on a model for anything professional. A model scoring 50% on BullshitBench is agreeing with wrong premises half the time. In a medical context, a legal context, or any domain where confident misinformation carries real costs, that is a meaningful problem. The Claude for Healthcare vs ChatGPT Health breakdown gets into how Anthropic is approaching that trust problem in specialized domains specifically.

Domain Doesn’t Matter Much

One finding that stands out: detection rates are roughly consistent across all five domains. Coding, medical, legal, finance, and physics all show similar rates of bullshit detection or failure. That tells you this is not really a domain knowledge problem. It is a behavioral disposition. A model is either calibrated to push back on wrong inputs or it is not, and that tendency transfers across domains at a pretty consistent rate regardless of subject matter.

This is actually a useful signal. If a model fails on medical questions, it is probably not because it lacks medical knowledge. It is because it is trained in a way that biases it toward confirmation and answer-generation over correction. That is a training philosophy problem, not a facts problem.

Reasoning Makes It Worse

The reasoning result is the most counterintuitive finding here. Models with extended thinking steps, the ones that use chain-of-thought before answering, actually perform worse on bullshit detection, not better. One theory from the discussion is that reasoning models are trained to arrive at an answer. When they think through a problem that contains a plausible but wrong premise, they are more likely to construct a path toward some conclusion rather than stopping to reject the premise entirely.

That framing makes sense to me. If your training signal rewards completing a reasoning chain and producing an output, you are going to produce an output. The model is not going to throw up its hands and say the question is based on a false assumption. It is going to work around the false assumption and give you something that sounds coherent. That is a problem if the false assumption is the whole point of the test.

This matters for how you think about reasoning models more broadly. They are capable tools for a lot of tasks. But more thinking steps do not universally improve outputs, and this benchmark is a clean example of that. For a look at where reasoning models do shine, the Claude Opus 4.6 vs GPT-5.3-Codex post covers the coding and capability side of that tradeoff.

Newer Models Are Not Reliably Better Here

Across the broader field, newer model versions are not showing meaningful improvement on this benchmark. Excluding Anthropic’s latest releases, there is no clear upward trend as model versions update. That is unusual. Most benchmarks show at least some forward movement as models get updated and retrained. BullshitBench v2 is one of the few where the field looks mostly flat.

Anthropic is the clear outlier. Their latest models show a measurable jump compared to earlier versions. Qwen 3.5 is also a standout, which fits the pattern of Alibaba’s models performing well on tasks that require factual discipline and a willingness to correct rather than confirm. The AI Labs LLM Rankings 2026 post has more context on where different model families sit across capability dimensions if you want the broader picture.

Why This Benchmark Is Worth Paying Attention To

Most benchmarks get easier over time because labs train on or near the benchmark distribution. BullshitBench is one of the few not showing that pattern across the board. That either means the labs are not optimizing for it specifically, or the underlying behavior it tests is genuinely hard to improve without changing something more fundamental about how models are trained to respond to user input.

I have had a feeling about Claude for a while. It is the model I trust most to actually disagree with me when I am wrong. Most models will find some way to validate your input and build an answer around it. Claude will tell you the premise is off. This benchmark puts data behind that impression, and v2 with more questions and more model variants confirms the same pattern holds.

The full dataset, questions, scripts, responses, and judgments are all open on GitHub. The interactive data explorer at petergpt.github.io is worth spending time with if you want to look at specific questions and how different models responded to them.