Robot holding a sign that says "Image Lost"

WildHallucinations: A Fresh Benchmark for LLM Factuality

The WildHallucinations benchmark (https://www.arxiv.org/pdf/2407.17468) is a new tool for assessing how well large language models (LLMs) handle factual information. It’s designed to test these models on real-world queries, including topics that might not have extensive online documentation. Also, it has secured its place as the third coolest name in AI Benchmarking after #1 HellaSwag and #2 Oobabooga.

Here’s what makes WildHallucinations stand out:

  • It uses actual user queries from chatbot conversations, not artificial test cases.
  • Responses are fact-checked against a curated knowledge base built from web searches.
  • Half the test entities don’t have Wikipedia pages, pushing models beyond common knowledge.

When put to the test, even advanced LLMs showed some concerning trends:

  • Models consistently made more mistakes on topics without Wikipedia coverage.
  • Performance varied widely across subject areas. Models did better with some science topics but struggled with celebrity and finance queries.
  • Retrieval-augmented models like Perplexity.ai had fewer overall errors, but sometimes introduced new inaccuracies from outdated or irrelevant sources.

Looking at specific models:

  • ChatGPT-4o: Strong in areas aligned with its training data, but prone to errors on newer or niche topics.
  • Gemini: Very cautious, often refusing to answer rather than risk an error. This approach limits mistakes but also reduces usefulness.
  • Perplexity.ai: Benefited from its retrieval system, but quality heavily depended on the sources it found.

The WildHallucinations benchmark highlights clear areas for improvement in LLMs:

  • More diverse training data is needed, especially for emerging topics.
  • Retrieval systems need refinement to consistently find relevant, up-to-date information.
  • Models must balance caution with usefulness. Being too conservative limits their real-world value.

This benchmark is a valuable tool for pushing LLM development forward. It provides a realistic measure of how these models handle the kinds of questions people actually ask, exposing weaknesses that need addressing.

As LLMs continue to improve, tools like WildHallucinations will be crucial in making sure they become more reliable and trustworthy sources of information.