Minimalist image with a pure white background and large centered black sans serif text that reads 'GPT-4 Only'

16,800 Papers Are Still Using GPT-4 In 2025. That’s A Problem.

There are about 16,800 Google Scholar results from 2025 that mention GPT-4 while explicitly excluding newer GPT models such as GPT-4o, GPT-4.1, GPT-4.5, or GPT-5. The query is not loose. It is written to match GPT-4 and then filter out every obvious successor string. These are 2025 papers built around a model that is now roughly 2.5 years old.

That is the core issue. If your paper on AI in education or LLMs in clinical practice is framed as describing what current systems can do, but you only test GPT-4 and never mention successors, you are publishing historical data and presenting it as current evidence.

What the 16,800-paper number actually says

The search behind the screenshot is deliberately strict. It:

  • Matches strings like GPT-4.
  • Explicitly excludes gpt-4o, gpt-4.1, gpt-4.5, and gpt-5*.
  • Applies a time filter for the year 2025.

That is why the ~16,800 figure is worrying. These are not 2023 pilot studies or 2024 preprints. They are 2025 publications, and in many cases they are written as if GPT-4 still stands in for “modern LLMs”.

The visible examples in the screenshot are from Springer, Wiley, and The Lancet Digital Health. Topics include GPT-4 in education, GPT-4 in clinical decision support, GPT-4 in patient communication, and so on. These are exactly the domains where people care about reliability, safety, and up-to-date performance.

Bar chart showing 16,800 GPT-4-only papers in 2025

Google Scholar snapshot: ~16,800 results in 2025 mentioning GPT-4 while excluding newer GPT models.

If you are doing a meta-analysis of GPT-4 performance, tracking model drift, or comparing GPT-4 to later versions, then focusing on GPT-4 makes sense. The problem is when these papers are written as if they answer questions like “how reliable are LLMs in clinical settings” and then people cite them as if they describe current systems that patients and students are interacting with.

Why are so many people still defaulting to GPT-4?

The most boring explanation is probably the right one. A lot of researchers are asking a chatbot what to use and then treating the answer as a binding recommendation. Something like:

  • “Which LLMs should I evaluate for my study on legal reasoning?”
  • “What is the best model to test for educational feedback?”

ChatGPT then suggests GPT-4, maybe alongside a couple of other well known models, and that suggestion gets written straight into the study design.

Why does the assistant keep pushing GPT-4?

  • Most of its pretraining data treats GPT-4 as the flagship GPT model, so its internal picture of “serious model to study” is frozen around that era.
  • Safety and product tuning encourage cautious, mainstream recommendations, not aggressive nudging toward the latest experimental variant.
  • Its sense of what is “popular and well documented” lags reality for the same reason academic citations lag reality.

Then there is the usual academic drag. Grant proposals, ethics committees, and procurement workflows often specify “GPT-4” months before data collection starts. Swapping the model later can trigger new approvals, new risk assessments, or even new funding discussions. So everyone sticks with the original choice, even if GPT-4o, GPT-4.1, GPT-4.5, or GPT-5 are live and clearly better by the time the experiments actually run.

By the time that paper appears in a 2025 journal issue, it is already describing a model that the vendor treats as a legacy product.

GPT-4 vs newer models: the gap is large

GPT-4 was a real jump over GPT-3-level systems. It passed professional exams near the top percentile, did well on coding benchmarks, and raised the bar for LLM research. That is exactly why it became the baseline model for so many academic projects.

But the successors are more than small tune-ups. OpenAI and others have shipped GPT-4o, GPT-4.1, GPT-4.5, GPT-5, and then the GPT-5.1 family. Public benchmark data shows big drops in factual error rates, especially on scientific and technical tasks, and much stronger reasoning and tool-use. GPT-5-level models have been described as having up to 80 percent fewer factual mistakes than GPT-4 and GPT-4o on some internal evaluations.

If your paper is titled something like “Reliability of GPT-4 for clinical decision support”, you are really measuring how a retired model behaves. That is interesting as a historical record or a baseline, but it is not how the strongest models in production behave in mid 2025.

If you want a concrete breakdown of how the newer GPT-5.1 variants are positioned and which ones are worth testing, I covered that here: GPT-5.1 Family on OpenRouter: API Access, Pricing, and Which Model To Use.

And if you want a sense of how quickly model providers are iterating beyond any single release, the broader race across labs is part of the point. I wrote about that dynamic with Gemini 3 Pro versus GPT-5.1 and Claude Opus 4.5 here: The AI Model Rush: Why Gemini 3 Pro Will Lead the Pack Against GPT-5.1 and Claude Opus 4.5.

How LLM recommendations keep research stuck on old models

There is a feedback loop here that nobody is really accountable for:

  • Researchers ask ChatGPT which model to study.
  • The assistant, trained on old data and tuned to be cautious, suggests GPT-4 as the safe, well known choice.
  • Researchers design the protocol around GPT-4 and publish in 2025.
  • Those GPT-4-focused papers end up in the training data for future models, reinforcing GPT-4 as the canonical research target.

The result is a kind of stasis. The assistant keeps recommending old models because the literature is dominated by old models, and the literature stays dominated by old models because people keep following the assistant’s recommendations.

OpenAI could soften this quite a bit by making their assistants very explicit about recency when someone is clearly designing a new study. If you say “I am planning a new evaluation of LLMs in radiology for 2025”, the answer should strongly promote the current flagship variants rather than GPT-4 as a default.

How to choose the right LLM for research in 2025

If you are planning a study now, here is a simple checklist to avoid publishing something obsolete on day one.

  • Start with provider documentation, not a chatbot. Look at the official model list and release notes from OpenAI, Anthropic, Google, or any other vendor you plan to use. Check which models they describe as current general-purpose flagships.
  • Use the current flagship for new work. If GPT-4o, GPT-4.1, GPT-4.5, GPT-5, or GPT-5.1 are available and clearly stronger than GPT-4, they should be your default starting point. That does not mean you ignore GPT-4, but it becomes a baseline, not the main act.
  • Be precise about model strings and dates. Write down the full model name and, if possible, the dated snapshot, such as gpt-5.1-2025-07-01. The vague label “GPT-4” is no longer enough for serious work.
  • Justify any use of older models. If you are forced onto GPT-4 because of institutional constraints, funding limits, or local access issues, say that clearly. That tells readers your results reflect local constraints, not the top of the field.
  • Stop outsourcing model selection to the thing you are studying. Use LLMs for brainstorming prompts, generating code, or summarising results, but not as the sole authority on which model you should test.

Why this is more than a version-number argument

This is not just about being fussy over minor version strings. When 2025 research on LLMs in education, healthcare, or law is still anchored on GPT-4, decision makers are going to get a skewed picture of what current systems can and cannot do.

They will read that “GPT-4 hallucinated X percent of the time” and then implicitly project that figure onto GPT-5-level systems that already have much lower error rates on comparable tasks. That kind of mismatch can delay adoption of safer systems or, in the opposite direction, cause people to dismiss real risks because they assume later models are magically fixed.

Policy debates end up anchored to the weakest model that was recently studied, not the strongest model that is actually deployed. Practitioners who rely on academic work to choose tools might avoid newer, better options because the only evidence they see is about a model that has effectively been deprecated.

As long as we keep asking an LLM which LLM to evaluate and then follow that recommendation without question, Google Scholar is going to fill up with work that is out of sync with production systems. Model choice needs to be treated as a central methodological decision, not a footnote and definitely not something delegated to the subject of the research.

If you are writing or reviewing LLM papers in 2025, ask one blunt question: is this about GPT-4 as a historical snapshot, or about the models people are actually using now? If you cannot answer that clearly from the title and methods section, something has gone wrong.