The short version: use the main five for 80% of your work — o3, Claude Sonnet 4, Gemini 2.5 Pro, GLM 4.5, Qwen3 Coder. The remaining 20% goes to task-specific models where price, reasoning depth, modalities, or latency matter more than raw IQ. Below is a practical field guide with when-to-use rules, tradeoffs, and cost notes. I’ve also added quick visuals and a cost/fit matrix.
The Five You’ll Use 80% of the Time
OpenAI o3 — Warrior
- Use when: You need steady reasoning, tool use, and reliable research. It’s my daily driver for analysis and agent planning.
- Strengths: Smart, consistent, tool-friendly, solid for quick research. 200k context.
- Weaknesses: Mid at long-form writing and frontend code polish.
- Modalities: Text, vision
- Pricing: $2 in / $8 out per million tokens
Claude Sonnet 4 — Artisan
- Use when: You want high-quality coding, strong writing and design sense, or clean tool/agent flows with fewer retries.
- Strengths: Great code and prose, design-aware, strong for agents.
- Weaknesses: Weaker on heavy reasoning. Pricey.
- Modalities: Text, vision
- Pricing: $3 in / $15 out
Gemini 2.5 Pro — Sorcerer
- Use when: You need multimodal with text/vision/audio and very long context, or creative ideation and UI/brand explorations.
- Strengths: Smart, creative, long context up to 1M, good code and design.
- Weaknesses: Reliability and steerability can slip; weak on tools.
- Modalities: Text, vision, audio
- Pricing: $1.25 in / $10 out
GLM 4.5 (Z.ai) — Apprentice
- Use when: You want an open model that’s cheap and competent at code, writing, design, and basic agent tasks.
- Strengths: Good code/writing/design for the price; open; cheap.
- Weaknesses: Not a deep reasoner, no vision.
- Modalities: Text
- Pricing: $0.50 in / $2 out
Qwen3 Coder — Apprentice
- Use when: You want speed-first coding with strong tool calling and rapid frontend iteration, especially on Cerebras.
- Strengths: Fast coding, 2k tps on Cerebras, strong tool calls, good frontend patterns.
- Weaknesses: Weak creativity/reasoning; no vision.
- Modalities: Text
- Pricing: $2 in / $2 out
If you only remember one rule: start with o3 or Sonnet 4 for most dev tasks. Switch to Gemini 2.5 Pro for huge multimodal context, GLM 4.5 for cheap open-source projects, and Qwen3 Coder when you want speed-first coding with tools.
Left to right: cost. Top to bottom: capability for general dev tasks. Use the main five by default; reach for specialists only when the constraints demand it.
The 20%: Specialized Models and When to Use Each
OpenAI
GPT-4.1 — Jack of All Trades
- Use when: You need reliable tool calling, structured output, long context, and you want a steady generalist that follows instructions cleanly.
- Strengths: Reliable tools, schema adherence, long context.
- Weaknesses: Not best in any single category.
- Modalities: Text, vision | Pricing: $2 in / $8 out | Context: 1M
GPT-4.1 Mini — Squire
- Use when: You want cheap agents at scale like customer support chat or lightweight RAG where correctness format matters more than depth.
- Strengths: Good tools, long context, cheap.
- Weaknesses: Lower intelligence ceiling.
- Modalities: Text, vision | Pricing: $0.40 / $1.60 | Context: 1M
GPT-4.1 Nano — Scribe
- Use when: You need a data extraction mule with strict JSON and high throughput.
- Strengths: Very cheap, reliable structured output.
- Weaknesses: Minimal reasoning.
- Modalities: Text, vision | Pricing: $0.10 / $0.40 | Context: 1M
GPT-4o — Bard
- Use when: You’re building voice agents with tools, need native image generation, or want a chatty assistant that handles memory nicely.
- Strengths: Voice mode, image generation, personalization.
- Weaknesses: Not very smart for complex reasoning.
- Modalities: Text, vision, speech, image generation | Pricing: $0.40 / $1.60 | Context: 1M
o4-mini — Tactician
- Use when: You want affordable, reliable reasoning for medium-complex agents that plan and reflect without blowing your budget.
- Strengths: Solid reasoning, fast and cheap compared to frontier models.
- Weaknesses: World knowledge can lag; weak writing/creativity.
- Modalities: Text, vision | Pricing: $1.10 / $4.40 | Context: 200k
o3-pro — Colossus
- Use when: You need the nuclear option for deep analysis, complex decisions, or jumbo context digestion. For 99.9% of tasks, it’s overkill.
- Strengths: Tons of thinking, context devourer.
- Weaknesses: Slow and expensive; bad for writing/design.
- Modalities: Text, vision | Pricing: $20 / $80 | Context: 200k
Anthropic
Claude Opus 4 — Goldsmith
- Use when: You want Sonnet’s strengths with more innate intelligence for code, writing, and design, and you can afford it.
- Strengths: Great code, writing, design, tool use.
- Weaknesses: Very expensive; still not a math-first tank.
- Modalities: Text, vision | Pricing: $15 / $75 | Context: 200k
Claude 3.5 Haiku — Page
- Use when: You want Claude vibes at a lower price for light coding and summarization, but GLM 4.5 often beats it on value.
- Strengths: Cheap-ish, lightweight Claude flavor.
- Weaknesses: Not very smart; value often loses to GLM 4.5.
- Modalities: Text, vision | Pricing: $0.80 / $4 | Context: 200k
Gemini 2.5 Flash — Archivist
- Use when: You need to process tons of audio or video cheaply with long context, where you care about throughput more than frontier intelligence.
- Strengths: Long context, cheap for media processing at scale.
- Weaknesses: Can get confused under heavy data; not as smart as top models.
- Modalities: Text, vision, audio | Pricing: $0.30 / $2.50 | Context: 1M
xAI
Grok 4 — Grandmaster
- Use when: You’re running long agents that need math and code coherence with high depth. Prompt differently; it thinks a lot.
- Strengths: Very smart, coherent over long runs, math/code strength.
- Weaknesses: Slow, expensive in practice, overthinks.
- Modalities: Text | Pricing: $3 / $15 | Context: 256k
Grok 3 Mini — Marshal
- Use when: You want strong price/perf for agents with good context handling and don’t need top-tier coding or design.
- Strengths: Great price/perf, agent-friendly.
- Weaknesses: Not the highest intelligence; weaker at code/design.
- Modalities: Text, vision | Pricing: $0.30 / $0.50 | Context: 128k
Moonshot
Kimi K2 — Apprentice
- Use when: You need cheap coding via Groq-level speed and decent agent behavior without long-context demands.
- Strengths: Good coding, cheap, very fast via Groq providers.
- Weaknesses: Weaker reasoning, hallucination risk, no vision, long context struggles.
- Modalities: Text | Pricing: $0.15 / $2.50 | Context: 128k
DeepSeek
DeepSeek R1 — Professor
- Use when: You want a reasoning-flavored model for distillation experiments, creative drafts, or quirky code.
- Strengths: Synthetic reasoning data, creative, decent coding.
- Weaknesses: Bad tool calling, no vision.
- Modalities: Text | Pricing: $3 / $5 | Context: 164k | Open-source
DeepSeek V3 — Actor
- Use when: You want a cheap chat model for general use and light code, but you value cost over precision with tools.
- Strengths: Cheap chat, decent code.
- Weaknesses: Not very smart, weak tools, no vision. GLM 4.5 often wins.
- Modalities: Text | Pricing: $0.25 / $0.85 | Context: 164k | Open-source
Z.ai
GLM 4.5 Air — Tinker
- Use when: You need a very cheap model for local agents or batch tasks where mistakes are tolerable.
- Strengths: Very cheap; OK tools.
- Weaknesses: Pretty dumb vs. main GLM 4.5; no vision.
- Modalities: Text | Pricing: $0.50 / $2 | Context: 128k | Open-source
Alibaba
Qwen3 Coder Flash — Yeoman
- Use when: You want local-friendly coding on small hardware or cheap API coding with tight loops.
- Strengths: Good code, cheap, accessible for local setups.
- Weaknesses: Not smart; no vision; limited API availability for now.
- Modalities: Text | Pricing: $0.20 / $0.80 | Context: 1M | Open-source
Qwen3 — Janus
- Use when: You need structured chat with niche formats and long context; not for serious coding.
- Strengths: Follows custom schemas well, cheap, long context.
- Weaknesses: Not a coding model; confusing naming and dual variants.
- Modalities: Text | Pricing: $0.15 / $0.80 | Context: 256k | Open-source
Google Open Models
Gemma 3 — Watchman
- Use when: You want local multimodal experiments with decent chat and light code on smaller GPUs.
- Strengths: Local-friendly, multimodal, design-aware for its size.
- Weaknesses: Not very smart overall.
- Modalities: Text, vision | Pricing: $0.09 / $0.17 | Context: 128k | Open-source
Choosing Quickly: A Practical Playbook
- General coding agent: Start with o3. If you need polished code or writing, switch to Claude Sonnet 4. If price bites, GLM 4.5.
- Frontend design + code handoff: Qwen3 Coder for speed. If it needs taste, check Sonnet 4.
- Huge documents or multimodal context: Gemini 2.5 Pro. For cheaper volume media processing, Gemini 2.5 Flash.
- Voice agent with tools and pictures: GPT-4o.
- Long-running, math-heavy agents: Grok 4. For cheaper agent loops, Grok 3 Mini or o4-mini.
- Structured extraction at scale: GPT-4.1 Nano. If you need higher accuracy without cost spikes, GPT-4.1 Mini.
- Open-source tinkering and local: GLM 4.5, Qwen3 Coder Flash, Gemma 3. Distillation experiments: DeepSeek R1.
- Atomic deep analysis: o3-pro, but only if you’ve proven nothing else is good enough.
Cost, Context, and Modality at a Glance
| Model | Type | Modalities | Context | $ In / Out | When to Use |
|---|---|---|---|---|---|
| o3 | Reasoning | Text, Vision | 200k | $2 / $8 | Daily driver reasoning + tools |
| Claude Sonnet 4 | Hybrid | Text, Vision | 200k | $3 / $15 | Code, writing, design quality |
| Gemini 2.5 Pro | Reasoning | Text, Vision, Audio | 1M | $1.25 / $10 | Multimodal long-context |
| GLM 4.5 | Hybrid | Text | 128k | $0.50 / $2 | Cheap open-source coding |
| Qwen3 Coder | Non-reasoning | Text | 256k | $2 / $2 | Fast coding + tools |
| GPT-4.1 | Non-reasoning | Text, Vision | 1M | $2 / $8 | Reliable tools + JSON |
| GPT-4.1 Mini | Non-reasoning | Text, Vision | 1M | $0.40 / $1.60 | Cheap agents |
| GPT-4.1 Nano | Non-reasoning | Text, Vision | 1M | $0.10 / $0.40 | Data extraction scale |
| GPT-4o | Non-reasoning | Text, Vision, Speech, Img Gen | 1M | $0.40 / $1.60 | Voice agents, native images |
| o4-mini | Reasoning | Text, Vision | 200k | $1.10 / $4.40 | Cheap reasoning |
| o3-pro | Reasoning | Text, Vision | 200k | $20 / $80 | Deep analysis only |
| Claude Opus 4 | Hybrid | Text, Vision | 200k | $15 / $75 | Premium code+writing |
| Claude 3.5 Haiku | Non-reasoning | Text, Vision | 200k | $0.80 / $4 | Light Claude tasks |
| Kimi K2 | Non-reasoning | Text | 128k | $0.15 / $2.50 | Groq-speed coding |
| Grok 4 | Reasoning | Text | 256k | $3 / $15 | Long agents + math |
| Grok 3 Mini | Reasoning | Text, Vision | 128k | $0.30 / $0.50 | Cheap agent loops |
| Gemini 2.5 Flash | Non-reasoning | Text, Vision, Audio | 1M | $0.30 / $2.50 | Volume media processing |
| DeepSeek R1 | Hybrid | Text | 164k | $3 / $5 | Distillation, creative reasoning |
| DeepSeek V3 | Non-reasoning | Text | 164k | $0.25 / $0.85 | Cheap chat, light code |
| GLM 4.5 Air | Non-reasoning | Text | 128k | $0.50 / $2 | Very cheap local agents |
| Qwen3 Coder Flash | Non-reasoning | Text | 1M | $0.20 / $0.80 | Local-friendly coding |
| Qwen3 | Non-reasoning | Text | 256k | $0.15 / $0.80 | Schema-heavy chat |
| Gemma 3 | Non-reasoning | Text, Vision | 128k | $0.09 / $0.17 | Local multimodal |
Notes on Reliability, Naming, and Routing
Two points I keep coming back to:
- Naming chaos: vendors keep reusing names and fragments. If you want a laugh or a headache, I wrote about the naming mess here: The Clowns of Naming.
- Model routing is the endgame: most teams will run a router that picks a model based on task, cost ceiling, and accuracy target. For a good primer on why agent workflows make price matter more than you expect, see Cheap AI Tokens, Expensive Tasks.
My Defaults
- Plan or reason: o3 or o4-mini for cheaper runs.
- Write code with taste: Claude Sonnet 4.
- Huge multimodal briefs: Gemini 2.5 Pro; for bulk media IO, Gemini 2.5 Flash.
- Cheap open-source coding: GLM 4.5; for speed loops, Qwen3 Coder.
- Voice or image-native assistant: GPT-4o.
- Extraction farm: GPT-4.1 Nano or Mini.
- Long agents with math: Grok 4; lower cost variant: Grok 3 Mini.
- One-off deep-dive: o3-pro if and only if nothing else is good enough.
Use the five main cards for most work. Pull out the specialists only when the job demands a specific constraint — modality, latency, context size, or price. That’s how you keep results high and bills low.