ConfidenceBench: Calibrating LLM Confidence, Not Just Accuracy

ConfidenceBench does one job most LLM benchmarks skip: it measures whether a model knows how sure it should be. The setup is simple and strict. A model answers 100 multiple choice questions and assigns a 0 to 10 confidence score to each answer. The scoring then punishes wrong answers given at high confidence and rewards correct answers paired with appropriate confidence. The result is a direct read on calibration. And right now, humans still win.

What ConfidenceBench measures

Most LLM evals reward raw correctness. That matters, but it misses how production systems actually fail. ConfidenceBench adds the missing piece: the alignment between stated confidence and actual correctness. The benchmark uses four categories to stress different failure modes and uncertainty patterns:

  • Spatial Reasoning — track real world physics and object state
  • High-Precision Math — keep exactness under strict constraints
  • Word Lookup from Texts — retrieve exact tokens instead of guessing
  • Offline Knowledge — recognize when something is unknowable

Each item requires two outputs: an answer and a confidence score from 0 to 10. This forces a model to expose its uncertainty in a way you can score. Overconfidence when wrong is the worst case, and the scoring makes that expensive.

You can read the benchmark, examples, and the live leaderboard at confidencebench.com. For broader context on what general LLM benchmarks measure and how leaderboards are usually built around task accuracy, see evidentlyai.com.

The scoring range is brutally asymmetric

Two reference points tell you how the scale works:

  • Best case: all 100 answers correct at confidence 10 yields +1000
  • Worst case: all 100 answers wrong at confidence 10 yields -10000

That 10x downside clarifies the goal. A calibrated system must lower confidence when it is uncertain. A model that guesses boldly will dig a deep hole fast.

ConfidenceBench score range

Reward is capped. Overconfident errors stack hard.

Live results: humans still beat the top models

The live leaderboard on confidencebench.com shows a pattern you rarely see in 2025. A human tester posts +397, wedged between GPT-5 variants. Many well known models go negative, which signals frequent high-confidence mistakes. ConfidenceBench puts numbers on something users feel every day: fluent answers that sound right while being wrong.

ConfidenceBench leaderboard

Calibration separates high performers from confident guessers.

Example tasks that flush out overconfidence

ConfidenceBench publishes a few non-secret samples so you can see the flavor of questions:

  • Spatial — a marble in a mug with a hole ends up on the floor, not in the fridge
  • High-Precision Math — specific digit extraction and multiplication, not an estimate
  • Word Lookup — exact token from a known text
  • Offline Knowledge — a question that cannot be known from the internet and should be marked uncertain

These formats block the usual pattern where a model bluffs with a fluent answer. If the model is not sure, the right play is to drop confidence. That is the behavior you want in production, too.

Why this matters for real systems

In production, two failures hurt the most. Confident wrong answers that trigger bad actions. And timid correct answers that get ignored. Calibration addresses both. A tuned model lowers its own authority when the evidence is thin and raises it when the evidence is strong. That lets you route to a human, perform a second check, or require extra retrieval only when needed.

Most public LLM leaderboards still center on accuracy and task scores. For a clear overview of how general LLM benchmarks work and how to think about metrics, see evidentlyai.com. ConfidenceBench adds a different target: statistical trust in the stated confidence, not just correct tokens. That target matches where evaluation is heading in tooling and practice.

What good calibration looks like

Think in bins. If you plot answers where the model said confidence 8, you want about 80 percent of those to be correct. At confidence 3, roughly 30 percent should be correct. Perfect calibration is rare. What matters is getting close, then shaping your application around it. Practical moves:

  • Route any answer below confidence 5 to a second model pass or a retrieval step
  • Require human sign-off below confidence 7 in safety-critical workflows
  • Reward models during training or selection when their hit rates match their claims
Calibration curve: ideal vs sample

Aim for the diagonal. Use the number to drive routing and review.

How to add calibration to your own LLM setup

You don’t need to submit to the public board to adopt the core ideas. You can add calibration today with a few changes to prompts and outputs.

  1. Make the model output a numeric confidence field. Use a clear spec and a fixed scale. A structured response reduces parsing errors, which is the first step to reliable scoring. Good notes on structured outputs here: bridgemind.ai.
  2. Control the format at the system level. A strict system prompt can enforce JSON fields for answer, confidence, and rationale. System prompts are the right tool when you need strict structure; useful explainer here: medium.com.
  3. Specify format and medium up front. If you need different outputs for different channels, say it explicitly. For example, “Return JSON with fields answer and confidence for the API” vs “Write a 5-line Slack message with an end summary.” That clarity improves consistency; good reminder here: linkedin.com.
  4. Prevent prompt injection from source texts. When you summarize or quote documents that contain commands, keep the model from obeying embedded instructions. That risk is real in evals and production. Worth a read: blog.tobiaszwingmann.com.
  5. Define bins and measure. Track hit rate by confidence bin. If the model says 7 a lot but only hits 50 percent, adjust your routing thresholds and prompt exemplars.

Scoring walk-through

Here’s a compact mental model of the scoring pressure:

  • Correct + confidence 10: you bank points, but the cap is +1000 across all 100 items.
  • Wrong + confidence 10: you eat a heavy penalty, and it compounds to -10000 if repeated.
  • Correct + confidence 5: small reward; being timid when you’re right doesn’t carry you.
  • Wrong + confidence 2: light penalty; hedging helps when uncertain.

This pushes models toward a behavior you can use: raise confidence where evidence is strong and lower it where context is thin or the question type is outside training comfort zones.

Limits and good questions to ask

  • Sample size. One hundred questions can show strong signals, but small swings can move ranks. That is fine for calibration directionality, less fine if you want fine-grained ordering between near-ties.
  • Private dataset. Keeping the set private reduces leakage and overfitting. It also makes external reproduction harder. This tradeoff is common in evals.
  • Scale mapping. The 0 to 10 scale is easy to prompt. It is not a true probability scale out of the box. You may still need temperature control, few-shot exemplars, or a post-hoc mapping layer to line up confidence bins with hit rates.
  • Metric gaming. If teams optimize only for this scoring, some systems will hedge too much. The point is balance: low confidence on hard items and high confidence when evidence is solid, not blanket caution.

Who should care

If you run agents that take actions, you should care. If you ship tools to analysts, doctors, lawyers, finance ops, or customer support, you should care. If you benchmark models for procurement, you should care. Calibration turns a model’s output into thresholds you can govern. That’s how you convert answers into policy and review rules.

How this fits with other eval tools I track

There’s a shift toward reliability testing, not just capability testing. Google’s Stax toolkit pushes repeatable evals rather than vibe checks. I covered that here: Stax Launches: Google’s New LLM Evaluation Toolkit Ends the Era of Vibe Testing. Calibration lives in that same bucket. I also track oddities in leaderboards and how scoreboards can mislead, which I wrote about in this roundup: AI News Roundup: LongCat’s Benchmark Paradox. ConfidenceBench helps counter one of the worst failure modes those boards hide: fluent, confident wrong answers.

Content and UX note when you publish results

If you’re going to publish calibration results or confidence-tagged answers in public docs, structure helps. Use clear headings, strong summaries, and end with an explicit call to action so readers know what to do next. Good practical reminders here: nealschaffer.com.

How to engage with ConfidenceBench

  • Read the site: confidencebench.com
  • Scan the paper on academia.edu or Google Drive
  • Watch the leaderboard over time and track whether calibration moves up with new releases
  • If you have results or want to collaborate, the contact on the site is open

Bottom line

ConfidenceBench changes the target from getting answers right to getting confidence right. That is closer to how production systems succeed or fail. The gap between humans and top LLMs shows how much work is left. I expect to see more evals move in this direction, and more teams training models to say I might be wrong when the signals are weak. If your system takes actions, you want that number. Treat confidence as a first-class signal.

Links

They're clicky!

Follow me on X Visit Ironwood AI →

Adam Holter

Founder of Ironwood AI. Writing about AI stuff!