AI Costs in 2025: Cheaper Tokens, Pricier Workflows – Why Your Bill is Still Rising

The AI economy in 2025 is a study in contrasts: raw AI token costs continue to plummet, yet overall AI bills for many businesses are heading in the opposite direction. Two forces still define AI economics: tokens are cheaper; workflows are hungrier. Since late 2024, agent chains involving planning, tool use, retrieval, and memory have multiplied token consumption per task. Even with prices falling further with the arrival of GPT-5 and Gemini 2.5, the per-task cost now depends more on the design of these agent chains and the ‘thinking’ budgets allocated than on the list price of individual tokens.

This paradox is rooted in the rise of agentic workflows. Models like o3, DeepSeek R1, Grok 4, and Kimi K2 introduced these multi-step processes, causing token consumption per task to jump 10x-100x since December 2023. Initially, venture capital subsidies allowed platforms to offer “unlimited usage,” effectively masking these rising costs. Now, as those subsidies expire, users face caps and hikes, revealing the true expense of these token-hungry processes.

The Shifting Sands of AI Pricing: What’s New Since Late 2024

The latter half of 2024 and early 2025 saw significant shifts in the pricing and capabilities of frontier models, resetting the competitive landscape:

  • OpenAI’s GPT-5 Lineup: The arrival of GPT-5 brought three API sizes and explicit reasoning control. The main GPT-5 model is priced at $1.25/M input tokens and $10/M output tokens. More affordably, GPT-5 Mini is $0.25/$2, and GPT-5 Nano is $0.05/$0.40. One critical detail: reasoning tokens are billed as output. This means that adjusting ‘thinking’ settings directly impacts your output bill. While Mini and Nano are significantly cheaper per token, they can become extremely reasoning-hungry and still rack up costs at higher thinking settings. For a deeper look at GPT-5, consider my prior guides like GPT-5: A Practical Upgrade and How to Pick the Right GPT-5 Model as a Developer.
  • Google’s Gemini 2.5 Series: Gemini 2.5 redefined its pricing tiers. Flash-Lite is $0.10/M input and $0.40/M output, Flash is $0.30/$2.50, and Pro is $1.25/$10. A notable advantage of the Gemini line is implicit caching and batch mode, which can significantly reduce costs on repeated context and offline workloads.
  • Anthropic’s Claude 4 Models: Opus 4.1 remains a premium offering at $15/$75 per million tokens, while Sonnet 4 is $3/$15. Haiku 3.5 is more accessible at $0.80/$4. Prompt caching can help mitigate input costs here. While still pricey, Claude models excel in specific, complex agentic tasks. My experience with Claude Opus 4.1 highlights its capabilities, especially for complex coding and tool use.
  • xAI Grok 4: Positioned as a reasoning-only API, Grok 4 rings in at $3/$15 per million tokens with a 256k context window. On paper, it looks similar to Sonnet, but its heavy use of thinking tokens means it can effectively cost 3x Sonnet with thinking enabled, and nearly 15x Sonnet without. This makes it deceptively expensive for general API tasks.
  • Moonshot’s Kimi K2: Released with open weights (1T params, ~32B active MoE), Kimi K2’s API pricing is around $0.15/M input (for cached hits) and $2.50/M output. It’s a strong contender for agentic coding and tool use, and its open weights allow for private deployments, which can offer greater cost control and privacy.
  • Alibaba’s Qwen3 Coder: Another open-weight model (480B params, ~35B active MoE), Qwen3 Coder is seen at $0.20/M input (for cached hits) and $0.80/M output. It’s also strong for agentic coding and tool use. Cerebras supports it at over 2k tps for $2/$2, offering blazing fast inference.

The True Cost: Beyond Token Price Tags

To truly understand AI expenditures, focusing solely on token list prices is a mistake. The ‘cost to run’ a task, which accounts for the token inflation caused by agentic workflows, is a far more accurate metric. Here’s an updated pricing snapshot that includes this crucial dimension:

ModelIntelligence IndexCoding IndexInput $/M tokensOutput $/M tokensCost to Run (task-adjusted) $
GPT-5 (high)69551.2510.00823
GPT-5 (medium)68551.2510.00650
Grok-468643.0015.001658
o367602.008.00432
o4-mini (high)65631.704.40410
Gemini 2.5 Pro64611.2510.00983
GPT-5 Mini64470.252.0067
Qwen3 23B 2507 (Reasoning)63610.708.40892
GPT-5 (low)59571.2510.00220
Claude 4 Sonnet Thinking59533.0015.00793
DeepSeek R1 052858590.571.90229
Gemini 2.5 (Reasoning)58550.305.00330
GLM-4.558540.252.00225

The ‘Cost to Run’ metric gives a more accurate picture by accounting for the increased token consumption in agentic workflows.

Mastering Reasoning Budgeting: The Hidden Cost Knob

One of the easiest ways to blow your AI budget is by mishandling reasoning. Treat ‘thinking’ or reasoning settings like a spend knob. Reasoning tokens are billed as output tokens, directly inflating your output costs. My recommendation:

  • Default to low thinking: For routine or less critical steps within an agentic workflow, keep the thinking budget minimal.
  • Increase only when necessary: Reserve higher reasoning settings for hard constraints, safety protocols, and complex planning stages where accuracy and reliability are paramount.
  • Consider cheaper models with high reasoning: This is a powerful cost-saving strategy. For example, using GPT-5 Nano with its highest reasoning setting will consume significantly more tokens than full GPT-5 with minimal reasoning. However, because GPT-5 Nano is so much cheaper per token, the overall cost could still be substantially lower, and it might be sufficient for your task. Similarly, GPT-5 Mini with high reasoning could outperform and cost less than a full GPT-5 model with minimal reasoning for specific applications. For more on GPT-5 Nano, see GPT-5 Nano on Cline.

Expect higher output token counts whenever reasoning is enabled. This isn’t a bug; it’s the model articulating its thought process, and that process comes with a price. Understanding this is key to optimizing your AI spend.

Why Agent Chains Fuel Token Inflation

Despite dropping token prices, agent models make tasks expensive due to inherent token inflation within their multi-step designs:

  • Verbose components: Lengthy tool manifests, overly descriptive system primers, and large, repeated retrieval payloads (e.g., pulling entire documents for every step) inflate input token counts.
  • Multi-agent handoffs: When multiple agents collaborate, the context passed between them can be redundant and token-heavy. Each handoff often requires resending substantial parts of the conversation history or necessary data.
  • Memory writes: Persistent memory systems, while crucial for long-running agents, incur costs with every write and retrieval, adding to the overall token tally.

Beyond raw tokens, platforms often add their own fees. Orchestration node fees and guardrails can add a measurable cost per step. For instance, some models might spam your search tool repeatedly, racking up external API costs on top of token usage. There’s also the dangerous ‘YOLO Mode,’ where unbounded retries and recursive planning can quietly explode both tokens and platform steps, leading to unexpected surges in billing.

Decoding Agent Chain Costs

Agentic workflows, while powerful, amplify token consumption through their structured, multi-step operations. Each ‘thought’ and ‘action’ translates to billable tokens.

Planning

Model ‘thinks’ and breaks down tasks. Generates intermediate steps, often requiring more output tokens.

Tool Use

Calls external APIs. Adds input tokens for tool manifests and output for results parsing.

Self-Reflection

Critiques its own output. Generates additional tokens for internal dialogue and re-planning.

Memory/Retrieval

Accesses or writes to long-term memory. Adds input/output tokens for context recall and updates.

Each step, while necessary for complex tasks, contributes to higher overall token usage and cost.

Recommended Tiering Strategy: Simple and Defensible

Given the complexities, a tiered approach to model selection is the most cost-effective and defensible strategy for managing AI expenses:

  • Default for chain-heavy workloads: Near-frontier models. For most agentic workflows that require multiple steps, go with models that offer massive cost savings for a slight performance trade-off. Excellent candidates include Qwen3 Coder or GLM 4.5. GPT-5 Mini is also making a strong case here, though community testing is still validating its performance for general agentic tasks. Keep thinking budgets low to medium for these. For more on GLM 4.5, check out GLM-4.5: Solid Writing Model That Matches the Competition.
  • Reserve expensive models: Use Claude Sonnet 4 or Opus 4.1 only when the cheaper models consistently fail to meet the required performance or handle specific, very complex constraints. These are powerful, but their cost means they should be a last resort.
  • Grok 4: Not for API tasks. I can’t recommend Grok 4 for any general API task. If you’re already paying for a subscription to it, then use it within its native interfaces. Its cost profile, particularly with its reasoning token consumption, makes it unsuitable for typical API integrations where cost per task is a major concern.

This strategy ensures you’re not overspending on premium models for tasks that can be handled by more cost-effective alternatives, while still having access to top-tier intelligence when absolutely necessary.

The Future of AI Economics: Optimization (Not Just Raw Price)

The trend is clear: raw token prices will continue to drop, but the complexity of AI applications will keep per-task costs significant. The focus for businesses and developers shifts from simply looking for the cheapest tokens to mastering the art of cost optimization within agentic workflows. This means:

  • Intelligent Agent Design: Building agents that are lean and efficient, minimizing redundant steps, prompt verbosity, and unnecessary retrieval actions.
  • Dynamic Model Switching: Implementing logic that automatically switches to the most cost-effective model for each sub-task within a chain. A low-cost model for initial data extraction, a mid-tier for simple reasoning, and a frontier model only for critical decision points.
  • Advanced Caching and Memory Management: Using robust caching layers and memory systems that don’t just store but intelligently retrieve and summarize context, reducing repeated input tokens.
  • Careful Reasoning Budgeting: As discussed, treating thinking as a controllable cost variable.

The narrative around AI costs has moved beyond simple arithmetic. It’s now about orchestration, efficiency, and making informed choices about where to allocate your ‘thinking’ budget. Those who master this balance will be the ones who truly harness the power of AI without breaking the bank.

Ultimately, the current landscape demands a more nuanced approach to AI procurement and deployment. Businesses need to move beyond raw token costs and consider the full lifecycle cost of their agentic applications. This requires constant monitoring, iterative optimization, and a willingness to experiment with different models and architectures to find the sweet spot between performance and expenditure. The ROI of AI is still immense, but only for those who manage its true cost intelligently.

Links

They're clicky!

Follow me on X Visit Ironwood AI →

Adam Holter

Founder of Ironwood AI. Writing about AI stuff!