ConfidenceBench does one job most LLM benchmarks skip: it measures whether a model knows how sure it should be. The setup is simple and strict. A model answers 100 multiple choice questions and assigns a 0 to 10 confidence score to each answer. The scoring then punishes wrong answers given at high confidence and rewards correct answers paired with appropriate confidence. The result is a direct read on calibration. And right now, humans still win.
What ConfidenceBench measures
Most LLM evals reward raw correctness. That matters, but it misses how production systems actually fail. ConfidenceBench adds the missing piece: the alignment between stated confidence and actual correctness. The benchmark uses four categories to stress different failure modes and uncertainty patterns:
- Spatial Reasoning — track real world physics and object state
- High-Precision Math — keep exactness under strict constraints
- Word Lookup from Texts — retrieve exact tokens instead of guessing
- Offline Knowledge — recognize when something is unknowable
Each item requires two outputs: an answer and a confidence score from 0 to 10. This forces a model to expose its uncertainty in a way you can score. Overconfidence when wrong is the worst case, and the scoring makes that expensive.
You can read the benchmark, examples, and the live leaderboard at confidencebench.com. For broader context on what general LLM benchmarks measure and how leaderboards are usually built around task accuracy, see evidentlyai.com.
The scoring range is brutally asymmetric
Two reference points tell you how the scale works:
- Best case: all 100 answers correct at confidence 10 yields +1000
- Worst case: all 100 answers wrong at confidence 10 yields -10000
That 10x downside clarifies the goal. A calibrated system must lower confidence when it is uncertain. A model that guesses boldly will dig a deep hole fast.
Reward is capped. Overconfident errors stack hard.
Live results: humans still beat the top models
The live leaderboard on confidencebench.com shows a pattern you rarely see in 2025. A human tester posts +397, wedged between GPT-5 variants. Many well known models go negative, which signals frequent high-confidence mistakes. ConfidenceBench puts numbers on something users feel every day: fluent answers that sound right while being wrong.
Calibration separates high performers from confident guessers.
Example tasks that flush out overconfidence
ConfidenceBench publishes a few non-secret samples so you can see the flavor of questions:
- Spatial — a marble in a mug with a hole ends up on the floor, not in the fridge
- High-Precision Math — specific digit extraction and multiplication, not an estimate
- Word Lookup — exact token from a known text
- Offline Knowledge — a question that cannot be known from the internet and should be marked uncertain
These formats block the usual pattern where a model bluffs with a fluent answer. If the model is not sure, the right play is to drop confidence. That is the behavior you want in production, too.
Why this matters for real systems
In production, two failures hurt the most. Confident wrong answers that trigger bad actions. And timid correct answers that get ignored. Calibration addresses both. A tuned model lowers its own authority when the evidence is thin and raises it when the evidence is strong. That lets you route to a human, perform a second check, or require extra retrieval only when needed.
Most public LLM leaderboards still center on accuracy and task scores. For a clear overview of how general LLM benchmarks work and how to think about metrics, see evidentlyai.com. ConfidenceBench adds a different target: statistical trust in the stated confidence, not just correct tokens. That target matches where evaluation is heading in tooling and practice.
What good calibration looks like
Think in bins. If you plot answers where the model said confidence 8, you want about 80 percent of those to be correct. At confidence 3, roughly 30 percent should be correct. Perfect calibration is rare. What matters is getting close, then shaping your application around it. Practical moves:
- Route any answer below confidence 5 to a second model pass or a retrieval step
- Require human sign-off below confidence 7 in safety-critical workflows
- Reward models during training or selection when their hit rates match their claims
Aim for the diagonal. Use the number to drive routing and review.
How to add calibration to your own LLM setup
You don’t need to submit to the public board to adopt the core ideas. You can add calibration today with a few changes to prompts and outputs.
- Make the model output a numeric confidence field. Use a clear spec and a fixed scale. A structured response reduces parsing errors, which is the first step to reliable scoring. Good notes on structured outputs here: bridgemind.ai.
- Control the format at the system level. A strict system prompt can enforce JSON fields for answer, confidence, and rationale. System prompts are the right tool when you need strict structure; useful explainer here: medium.com.
- Specify format and medium up front. If you need different outputs for different channels, say it explicitly. For example, “Return JSON with fields answer and confidence for the API” vs “Write a 5-line Slack message with an end summary.” That clarity improves consistency; good reminder here: linkedin.com.
- Prevent prompt injection from source texts. When you summarize or quote documents that contain commands, keep the model from obeying embedded instructions. That risk is real in evals and production. Worth a read: blog.tobiaszwingmann.com.
- Define bins and measure. Track hit rate by confidence bin. If the model says 7 a lot but only hits 50 percent, adjust your routing thresholds and prompt exemplars.
Scoring walk-through
Here’s a compact mental model of the scoring pressure:
- Correct + confidence 10: you bank points, but the cap is +1000 across all 100 items.
- Wrong + confidence 10: you eat a heavy penalty, and it compounds to -10000 if repeated.
- Correct + confidence 5: small reward; being timid when you’re right doesn’t carry you.
- Wrong + confidence 2: light penalty; hedging helps when uncertain.
This pushes models toward a behavior you can use: raise confidence where evidence is strong and lower it where context is thin or the question type is outside training comfort zones.
Limits and good questions to ask
- Sample size. One hundred questions can show strong signals, but small swings can move ranks. That is fine for calibration directionality, less fine if you want fine-grained ordering between near-ties.
- Private dataset. Keeping the set private reduces leakage and overfitting. It also makes external reproduction harder. This tradeoff is common in evals.
- Scale mapping. The 0 to 10 scale is easy to prompt. It is not a true probability scale out of the box. You may still need temperature control, few-shot exemplars, or a post-hoc mapping layer to line up confidence bins with hit rates.
- Metric gaming. If teams optimize only for this scoring, some systems will hedge too much. The point is balance: low confidence on hard items and high confidence when evidence is solid, not blanket caution.
Who should care
If you run agents that take actions, you should care. If you ship tools to analysts, doctors, lawyers, finance ops, or customer support, you should care. If you benchmark models for procurement, you should care. Calibration turns a model’s output into thresholds you can govern. That’s how you convert answers into policy and review rules.
How this fits with other eval tools I track
There’s a shift toward reliability testing, not just capability testing. Google’s Stax toolkit pushes repeatable evals rather than vibe checks. I covered that here: Stax Launches: Google’s New LLM Evaluation Toolkit Ends the Era of Vibe Testing. Calibration lives in that same bucket. I also track oddities in leaderboards and how scoreboards can mislead, which I wrote about in this roundup: AI News Roundup: LongCat’s Benchmark Paradox. ConfidenceBench helps counter one of the worst failure modes those boards hide: fluent, confident wrong answers.
Content and UX note when you publish results
If you’re going to publish calibration results or confidence-tagged answers in public docs, structure helps. Use clear headings, strong summaries, and end with an explicit call to action so readers know what to do next. Good practical reminders here: nealschaffer.com.
How to engage with ConfidenceBench
- Read the site: confidencebench.com
- Scan the paper on academia.edu or Google Drive
- Watch the leaderboard over time and track whether calibration moves up with new releases
- If you have results or want to collaborate, the contact on the site is open
Bottom line
ConfidenceBench changes the target from getting answers right to getting confidence right. That is closer to how production systems succeed or fail. The gap between humans and top LLMs shows how much work is left. I expect to see more evals move in this direction, and more teams training models to say I might be wrong when the signals are weak. If your system takes actions, you want that number. Treat confidence as a first-class signal.

