The words OPENROUTER WRAPPED 2025 printed in bold black sans serif font on a pure white background

OpenRouter’s 100 Trillion Token Study: The Real State of AI Usage in 2025

OpenRouter just dropped a massive empirical study covering over 100 trillion tokens of anonymized metadata. This is not a theoretical white paper or a hype piece about AI potential. It’s a detailed look at what people are actually doing with LLMs across models, geographies, tasks, and time. The data runs from November 2024 through November 2025, and the findings challenge a lot of assumptions about how AI is being used in production.

The Open Source Surge Is Real and Durable

Proprietary models still handle the majority of tokens, but open-weight models have grown to roughly one-third of total usage by late 2025. That’s not a temporary spike from benchmark tourism. When DeepSeek V3, Kimi K2, or Qwen 3 Coder dropped, usage spiked and stayed elevated, indicating real production adoption rather than just curiosity. This confirms the ecosystem is structurally dual: proprietary for reliability and OSS for cost efficiency and customization.

Average Token Share by Model Origin

Average token share over the study period shows a strong proprietary lead, but OSS holds a significant 26.7% combined share.

Chinese OSS models specifically went from low single digits to nearly 30% of weekly tokens in some periods. DeepSeek alone accounts for 14.37 trillion tokens over the year, followed by Qwen at 5.59 trillion and Meta LLaMA at 3.96 trillion. The market is fragmenting though. Late 2024 looked like a DeepSeek near-monopoly with V3 and R1 together holding over 50% of OSS tokens. By late 2025, no single OSS model holds more than about 25%, with leadership fluidly shifting between Qwen, MiniMax M2, Moonshot’s Kimi K2, and the GPT-OSS variants.

Medium Models Are the New Sweet Spot

The study finds that small models under 15B parameters are increasingly irrelevant despite being plentiful. Their usage share is declining. Medium models between 15B and 70B barely existed as a category before Qwen2.5 Coder 32B, but now represent a clear capability-to-cost sweet spot. This is the ‘model-market fit’ for a balanced workload. Large models over 70B still see strong usage, but users compare several options side by side rather than converging on one winner. This suggests usage is bifurcating between efficient medium models and maximal quality large models.

Roleplay and Programming Dominate Real-World Tasks

This is where the data challenges the common narrative that LLMs are purely productivity tools. When you look only at open-weight models, two categories account for the vast majority of usage: Roleplay at roughly 52% and Programming as the second largest category.

Roleplay includes games, persona chat, co-writing, and adult content. Open models are used here because closed models refuse. Programming usage is growing steadily, with developers using OSS for cost efficiency and deployment flexibility.

OSS Usage by Category

Roleplay dominates open source model usage, followed by programming assistance.

For Chinese OSS models specifically, the mix tilts more professional: Roleplay is still the largest single category at around 33%, but coding and technology combined account for about 39% of usage. This demonstrates where Chinese models like Qwen and DeepSeek have been focusing their efforts.

Across all models (closed + open), Programming emerged as the single dominant and growing category, rising from 11% to over 50% of total tokens by late 2025. This reflects the mainstream adoption of AI-assisted development tools.

Agentic Inference Is Taking Over

The release of OpenAI’s o1 in late 2024 normalized multi-step, reasoning-style inference. This shift is clearly visible in the usage data: reasoning models now represent over 50% of all token usage by late 2025. This is a massive shift from early 2025 when it was negligible. Tool-calling adoption shows a clear upward trend, becoming central for high-value workflows.

This complexity is reflected in sequence lengths. Average prompt length increased roughly 4x from about 1,500 to over 6,000 tokens since early 2024. Completions nearly tripled from around 150 to 400 tokens. Programming drives most of this complexity, often requiring inputs exceeding 20,000 tokens. Users are passing entire codebases and documents into models rather than simple queries. The implication is clear: the typical LLM request is now part of a structured, multi-step workflow, meaning agentic inference is becoming the default.

The Glass Slipper Effect: Retention is the New North Star

One of the more interesting findings is what the authors call the Glass Slipper effect. Most model cohorts show high churn and rapid decay in retention. But some foundational cohorts show unusually strong retention, indicating they found a lasting workload-model fit. For example, specific cohorts of Claude 4 Sonnet and Gemini 2.5 Pro retained roughly 40% of users at Month 5, far above the average.

The hypothesis is that when a model finally solves a previously unsolved workload at the right technical and economic constraints, users embed it deeply into their pipelines. Switching costs then create strong lock-in. Models that don’t find this fit remain merely ‘good enough’ alternatives. This is why retention curves can serve as a fingerprint of true capability breakthroughs.

Price Doesn’t Explain Usage

The study shows weak correlation between a model’s effective cost and its total usage volume. The market segments clearly:

AI Model Market Segmentation by Cost vs Usage

The market is segmented into distinct archetypes based on cost and usage volume.

  • Efficient giants: Gemini 2.0 Flash at $0.15 per million tokens and DeepSeek V3 at $0.39 per million see massive adoption through strong quality at low prices. They drive high-volume, cost-sensitive workloads.
  • Premium leaders: Claude 3.7 Sonnet at roughly $1.96 per million maintains high usage despite premium pricing because users value quality and reliability for high-stakes tasks.
  • Premium specialists: GPT-4 at over $34 per million sees lower usage reserved for critical, highest-stakes tasks.

The market is not commoditized. Differentiation—whether quality, latency, context, or tooling—still drives pricing power. Enterprises pay for reliability. Hobbyists and pipelines are cost-sensitive. Both segments are durable.

Geographic Distribution and Language

Usage is global. North America accounts for roughly 47% of spend, but Asia has grown significantly to about 29% and is rising. Europe holds steady in the low twenties. This highlights the growing importance of non-Western players, both as consumers and producers of models. English dominates language usage at over 80%, with Chinese Simplified at nearly 5% and Russian at about 2.5%. This underscores the need for cross-lingual quality and regional compliance for global adoption.

Implications for the AI Ecosystem

The data clearly shows that different providers have distinct category profiles aligned with their strategic goals. Anthropic’s Claude is heavily concentrated in programming and technology, positioning itself as the reasoning/coding model. Google’s Gemini shows a broader mix across translation, science, technology, and general knowledge, aiming for wide utility. xAI started heavily programming-focused but has broadened as its user base expanded.

The LLM market is structurally plural. No single model dominates across tasks. Closed and open models both have durable positions. Multi-model stacks are standard practice. Developers need model-agnostic platforms to route traffic efficiently based on cost, quality, and specific task needs. For model builders, the competitive frontier is shifting from single-pass accuracy to robust multi-step reasoning, tool handling, and long-context resilience.

One important caveat: usage data is heavily skewed by release date. Newer models that are better than older models will show much lower usage simply because they’ve had less time to accumulate it. Claude Opus 4.5 is a much better coder than any Sonnet model, but Claude Sonnet 4 has had far more cumulative usage. Raw usage numbers don’t indicate quality; retention curves and specific workload fit are better indicators of true capability breakthroughs.

The study provides a useful empirical baseline for understanding how LLMs are actually used rather than how people assume they’re used. The gap between those two things is larger than most people realize. The focus needs to shift from benchmarks to operational excellence in agentic systems and task success.