Pure white background with centered black sans serif text that reads 'The 20%'. No other elements. Crisp, high-contrast letters.

The 20% Toolkit: Specialized LLMs Developers Actually Need in 2025

The short version: use the main five for 80% of your work — o3, Claude Sonnet 4, Gemini 2.5 Pro, GLM 4.5, Qwen3 Coder. The remaining 20% goes to task-specific models where price, reasoning depth, modalities, or latency matter more than raw IQ. Below is a practical field guide with when-to-use rules, tradeoffs, and cost notes. I’ve also added quick visuals and a cost/fit matrix.

The Five You’ll Use 80% of the Time

OpenAI o3 — Warrior

  • Use when: You need steady reasoning, tool use, and reliable research. It’s my daily driver for analysis and agent planning.
  • Strengths: Smart, consistent, tool-friendly, solid for quick research. 200k context.
  • Weaknesses: Mid at long-form writing and frontend code polish.
  • Modalities: Text, vision
  • Pricing: $2 in / $8 out per million tokens

Claude Sonnet 4 — Artisan

  • Use when: You want high-quality coding, strong writing and design sense, or clean tool/agent flows with fewer retries.
  • Strengths: Great code and prose, design-aware, strong for agents.
  • Weaknesses: Weaker on heavy reasoning. Pricey.
  • Modalities: Text, vision
  • Pricing: $3 in / $15 out

Gemini 2.5 Pro — Sorcerer

  • Use when: You need multimodal with text/vision/audio and very long context, or creative ideation and UI/brand explorations.
  • Strengths: Smart, creative, long context up to 1M, good code and design.
  • Weaknesses: Reliability and steerability can slip; weak on tools.
  • Modalities: Text, vision, audio
  • Pricing: $1.25 in / $10 out

GLM 4.5 (Z.ai) — Apprentice

  • Use when: You want an open model that’s cheap and competent at code, writing, design, and basic agent tasks.
  • Strengths: Good code/writing/design for the price; open; cheap.
  • Weaknesses: Not a deep reasoner, no vision.
  • Modalities: Text
  • Pricing: $0.50 in / $2 out

Qwen3 Coder — Apprentice

  • Use when: You want speed-first coding with strong tool calling and rapid frontend iteration, especially on Cerebras.
  • Strengths: Fast coding, 2k tps on Cerebras, strong tool calls, good frontend patterns.
  • Weaknesses: Weak creativity/reasoning; no vision.
  • Modalities: Text
  • Pricing: $2 in / $2 out

If you only remember one rule: start with o3 or Sonnet 4 for most dev tasks. Switch to Gemini 2.5 Pro for huge multimodal context, GLM 4.5 for cheap open-source projects, and Qwen3 Coder when you want speed-first coding with tools.

Fit Matrix: Cost vs. Capability $ $$$ Capability GLM 4.5 Qwen3 Coder o3 Sonnet 4 Gemini 2.5 Pro

Left to right: cost. Top to bottom: capability for general dev tasks. Use the main five by default; reach for specialists only when the constraints demand it.

The 20%: Specialized Models and When to Use Each

OpenAI

GPT-4.1 — Jack of All Trades

  • Use when: You need reliable tool calling, structured output, long context, and you want a steady generalist that follows instructions cleanly.
  • Strengths: Reliable tools, schema adherence, long context.
  • Weaknesses: Not best in any single category.
  • Modalities: Text, vision | Pricing: $2 in / $8 out | Context: 1M

GPT-4.1 Mini — Squire

  • Use when: You want cheap agents at scale like customer support chat or lightweight RAG where correctness format matters more than depth.
  • Strengths: Good tools, long context, cheap.
  • Weaknesses: Lower intelligence ceiling.
  • Modalities: Text, vision | Pricing: $0.40 / $1.60 | Context: 1M

GPT-4.1 Nano — Scribe

  • Use when: You need a data extraction mule with strict JSON and high throughput.
  • Strengths: Very cheap, reliable structured output.
  • Weaknesses: Minimal reasoning.
  • Modalities: Text, vision | Pricing: $0.10 / $0.40 | Context: 1M

GPT-4o — Bard

  • Use when: You’re building voice agents with tools, need native image generation, or want a chatty assistant that handles memory nicely.
  • Strengths: Voice mode, image generation, personalization.
  • Weaknesses: Not very smart for complex reasoning.
  • Modalities: Text, vision, speech, image generation | Pricing: $0.40 / $1.60 | Context: 1M

o4-mini — Tactician

  • Use when: You want affordable, reliable reasoning for medium-complex agents that plan and reflect without blowing your budget.
  • Strengths: Solid reasoning, fast and cheap compared to frontier models.
  • Weaknesses: World knowledge can lag; weak writing/creativity.
  • Modalities: Text, vision | Pricing: $1.10 / $4.40 | Context: 200k

o3-pro — Colossus

  • Use when: You need the nuclear option for deep analysis, complex decisions, or jumbo context digestion. For 99.9% of tasks, it’s overkill.
  • Strengths: Tons of thinking, context devourer.
  • Weaknesses: Slow and expensive; bad for writing/design.
  • Modalities: Text, vision | Pricing: $20 / $80 | Context: 200k

Anthropic

Claude Opus 4 — Goldsmith

  • Use when: You want Sonnet’s strengths with more innate intelligence for code, writing, and design, and you can afford it.
  • Strengths: Great code, writing, design, tool use.
  • Weaknesses: Very expensive; still not a math-first tank.
  • Modalities: Text, vision | Pricing: $15 / $75 | Context: 200k

Claude 3.5 Haiku — Page

  • Use when: You want Claude vibes at a lower price for light coding and summarization, but GLM 4.5 often beats it on value.
  • Strengths: Cheap-ish, lightweight Claude flavor.
  • Weaknesses: Not very smart; value often loses to GLM 4.5.
  • Modalities: Text, vision | Pricing: $0.80 / $4 | Context: 200k

Google

Gemini 2.5 Flash — Archivist

  • Use when: You need to process tons of audio or video cheaply with long context, where you care about throughput more than frontier intelligence.
  • Strengths: Long context, cheap for media processing at scale.
  • Weaknesses: Can get confused under heavy data; not as smart as top models.
  • Modalities: Text, vision, audio | Pricing: $0.30 / $2.50 | Context: 1M

xAI

Grok 4 — Grandmaster

  • Use when: You’re running long agents that need math and code coherence with high depth. Prompt differently; it thinks a lot.
  • Strengths: Very smart, coherent over long runs, math/code strength.
  • Weaknesses: Slow, expensive in practice, overthinks.
  • Modalities: Text | Pricing: $3 / $15 | Context: 256k

Grok 3 Mini — Marshal

  • Use when: You want strong price/perf for agents with good context handling and don’t need top-tier coding or design.
  • Strengths: Great price/perf, agent-friendly.
  • Weaknesses: Not the highest intelligence; weaker at code/design.
  • Modalities: Text, vision | Pricing: $0.30 / $0.50 | Context: 128k

Moonshot

Kimi K2 — Apprentice

  • Use when: You need cheap coding via Groq-level speed and decent agent behavior without long-context demands.
  • Strengths: Good coding, cheap, very fast via Groq providers.
  • Weaknesses: Weaker reasoning, hallucination risk, no vision, long context struggles.
  • Modalities: Text | Pricing: $0.15 / $2.50 | Context: 128k

DeepSeek

DeepSeek R1 — Professor

  • Use when: You want a reasoning-flavored model for distillation experiments, creative drafts, or quirky code.
  • Strengths: Synthetic reasoning data, creative, decent coding.
  • Weaknesses: Bad tool calling, no vision.
  • Modalities: Text | Pricing: $3 / $5 | Context: 164k | Open-source

DeepSeek V3 — Actor

  • Use when: You want a cheap chat model for general use and light code, but you value cost over precision with tools.
  • Strengths: Cheap chat, decent code.
  • Weaknesses: Not very smart, weak tools, no vision. GLM 4.5 often wins.
  • Modalities: Text | Pricing: $0.25 / $0.85 | Context: 164k | Open-source

Z.ai

GLM 4.5 Air — Tinker

  • Use when: You need a very cheap model for local agents or batch tasks where mistakes are tolerable.
  • Strengths: Very cheap; OK tools.
  • Weaknesses: Pretty dumb vs. main GLM 4.5; no vision.
  • Modalities: Text | Pricing: $0.50 / $2 | Context: 128k | Open-source

Alibaba

Qwen3 Coder Flash — Yeoman

  • Use when: You want local-friendly coding on small hardware or cheap API coding with tight loops.
  • Strengths: Good code, cheap, accessible for local setups.
  • Weaknesses: Not smart; no vision; limited API availability for now.
  • Modalities: Text | Pricing: $0.20 / $0.80 | Context: 1M | Open-source

Qwen3 — Janus

  • Use when: You need structured chat with niche formats and long context; not for serious coding.
  • Strengths: Follows custom schemas well, cheap, long context.
  • Weaknesses: Not a coding model; confusing naming and dual variants.
  • Modalities: Text | Pricing: $0.15 / $0.80 | Context: 256k | Open-source

Google Open Models

Gemma 3 — Watchman

  • Use when: You want local multimodal experiments with decent chat and light code on smaller GPUs.
  • Strengths: Local-friendly, multimodal, design-aware for its size.
  • Weaknesses: Not very smart overall.
  • Modalities: Text, vision | Pricing: $0.09 / $0.17 | Context: 128k | Open-source

Choosing Quickly: A Practical Playbook

  • General coding agent: Start with o3. If you need polished code or writing, switch to Claude Sonnet 4. If price bites, GLM 4.5.
  • Frontend design + code handoff: Qwen3 Coder for speed. If it needs taste, check Sonnet 4.
  • Huge documents or multimodal context: Gemini 2.5 Pro. For cheaper volume media processing, Gemini 2.5 Flash.
  • Voice agent with tools and pictures: GPT-4o.
  • Long-running, math-heavy agents: Grok 4. For cheaper agent loops, Grok 3 Mini or o4-mini.
  • Structured extraction at scale: GPT-4.1 Nano. If you need higher accuracy without cost spikes, GPT-4.1 Mini.
  • Open-source tinkering and local: GLM 4.5, Qwen3 Coder Flash, Gemma 3. Distillation experiments: DeepSeek R1.
  • Atomic deep analysis: o3-pro, but only if you’ve proven nothing else is good enough.

Cost, Context, and Modality at a Glance

Model Type Modalities Context $ In / Out When to Use
o3 Reasoning Text, Vision 200k $2 / $8 Daily driver reasoning + tools
Claude Sonnet 4 Hybrid Text, Vision 200k $3 / $15 Code, writing, design quality
Gemini 2.5 Pro Reasoning Text, Vision, Audio 1M $1.25 / $10 Multimodal long-context
GLM 4.5 Hybrid Text 128k $0.50 / $2 Cheap open-source coding
Qwen3 Coder Non-reasoning Text 256k $2 / $2 Fast coding + tools
GPT-4.1 Non-reasoning Text, Vision 1M $2 / $8 Reliable tools + JSON
GPT-4.1 Mini Non-reasoning Text, Vision 1M $0.40 / $1.60 Cheap agents
GPT-4.1 Nano Non-reasoning Text, Vision 1M $0.10 / $0.40 Data extraction scale
GPT-4o Non-reasoning Text, Vision, Speech, Img Gen 1M $0.40 / $1.60 Voice agents, native images
o4-mini Reasoning Text, Vision 200k $1.10 / $4.40 Cheap reasoning
o3-pro Reasoning Text, Vision 200k $20 / $80 Deep analysis only
Claude Opus 4 Hybrid Text, Vision 200k $15 / $75 Premium code+writing
Claude 3.5 Haiku Non-reasoning Text, Vision 200k $0.80 / $4 Light Claude tasks
Kimi K2 Non-reasoning Text 128k $0.15 / $2.50 Groq-speed coding
Grok 4 Reasoning Text 256k $3 / $15 Long agents + math
Grok 3 Mini Reasoning Text, Vision 128k $0.30 / $0.50 Cheap agent loops
Gemini 2.5 Flash Non-reasoning Text, Vision, Audio 1M $0.30 / $2.50 Volume media processing
DeepSeek R1 Hybrid Text 164k $3 / $5 Distillation, creative reasoning
DeepSeek V3 Non-reasoning Text 164k $0.25 / $0.85 Cheap chat, light code
GLM 4.5 Air Non-reasoning Text 128k $0.50 / $2 Very cheap local agents
Qwen3 Coder Flash Non-reasoning Text 1M $0.20 / $0.80 Local-friendly coding
Qwen3 Non-reasoning Text 256k $0.15 / $0.80 Schema-heavy chat
Gemma 3 Non-reasoning Text, Vision 128k $0.09 / $0.17 Local multimodal

Notes on Reliability, Naming, and Routing

Two points I keep coming back to:

  • Naming chaos: vendors keep reusing names and fragments. If you want a laugh or a headache, I wrote about the naming mess here: The Clowns of Naming.
  • Model routing is the endgame: most teams will run a router that picks a model based on task, cost ceiling, and accuracy target. For a good primer on why agent workflows make price matter more than you expect, see Cheap AI Tokens, Expensive Tasks.

My Defaults

  • Plan or reason: o3 or o4-mini for cheaper runs.
  • Write code with taste: Claude Sonnet 4.
  • Huge multimodal briefs: Gemini 2.5 Pro; for bulk media IO, Gemini 2.5 Flash.
  • Cheap open-source coding: GLM 4.5; for speed loops, Qwen3 Coder.
  • Voice or image-native assistant: GPT-4o.
  • Extraction farm: GPT-4.1 Nano or Mini.
  • Long agents with math: Grok 4; lower cost variant: Grok 3 Mini.
  • One-off deep-dive: o3-pro if and only if nothing else is good enough.

Use the five main cards for most work. Pull out the specialists only when the job demands a specific constraint — modality, latency, context size, or price. That’s how you keep results high and bills low.