Pure white background. Centered black sans serif text reading Cerebras 20x. No other elements.

Cerebras opens a free 1M tokens per day inference tier and claims ~20x faster than NVIDIA: real benchmarks, model limits, and why ui2 matters

Cerebras just made inference cheap to try and fast to ship. The company opened its Inference API with a free tier of 1 million tokens per day and is publishing head‑to‑head numbers that claim up to 20x faster throughput than NVIDIA on key large models. If you build anything latency sensitive or token hungry, this matters. Here’s what’s in the offer, the real speed numbers, where it helps, what I’d test first, and how a unified intent layer like Evan Zhou’s ui2 can sit on top to route work across providers.

What Cerebras is offering, in plain terms

  • Free tier: 1,000,000 tokens per day to prototype and ship early without paying. No waitlist.
  • Speed: vendor benchmarks show multi‑thousand tokens per second on big models and instant-ish response for 70B scale.
  • Models: Llama4, Llama3.1 8B and 70B, Qwen3 32B and 235B Instruct, and more in rotation. Frontier scale models are on the roadmap, including workloads at or above GPT‑OSS‑120B.
  • Context: up to 64K tokens on the free tier for Qwen3 235B Instruct and up to 131K for paid. That covers large codebases and long documents.
  • Integration: available through OpenRouter and Hugging Face, so a lot of existing tooling will just work.

The headline speeds

These are the numbers Cerebras is publishing right now:

  • Llama4Scout: ~2,600 tokens per second.
  • Llama3.1 8B: ~1,800 tokens per second and called 2.4x faster than Groq in their test.
  • Llama3.1 70B: ~450 tokens per second. That’s instant response at 70B scale.
  • Qwen3 235B Instruct: ~1,400 tokens per second, with 64K context on free and 131K on paid.
  • Llama 4 Maverick 400B: ~2,522 tokens per second vs NVIDIA Blackwell at ~1,038 in their cited run.
Tokens per second across models on Cerebras

Cerebras TPS across listed models. Vendor numbers. Run your own checks.

Cerebras vs NVIDIA Blackwell on Llama 4 Maverick 400B

Llama 4 Maverick 400B tokens per second. Source: vendor benchmarks.

Why it is fast

The speed story is rooted in the hardware. Cerebras runs inference on its third‑generation Wafer‑Scale Engine, WSE‑3. Instead of stitching lots of small chips together with links that add latency, WSE puts a vast amount of compute and memory on one giant die. That cuts inter‑chip traffic and lets them stream tokens with low stalls. They also run native 16‑bit weights to preserve accuracy while packing more throughput, and they claim about one‑third the power draw compared to NVIDIA DGX‑class setups for these workloads. The company added six data centers across North America and Europe to keep capacity close to users.

Relative power use for the same workload

Power ratio for comparable inference. Vendor claim: Cerebras at one‑third the power.

Where this helps today

  • Interactive coding and agents: High TPS means your agent loop can plan, call tools, and stream responses without feeling sluggish.
  • Long context use: 64K on free and up to 131K on paid with Qwen3 235B Instruct covers spec docs, multi‑file repositories, and discovery tasks.
  • Speech, video, and multimodal: Low‑latency steps make live pipelines workable. For a view on low‑latency creative pipelines, I wrote about it in Krea’s realtime Sculpt‑ad‑Video and Google Veo 3.
  • OSS frontier models: Support for GPT‑OSS‑120B‑class models means open weights at scale can be practical to deploy for production inference, not just lab demos.

Costs, access, and scale

Free matters because it reduces risk for teams who want to move fast. One million tokens per day is enough for serious prototyping, small internal tools, or a pilot with real users. Cerebras also says cost per token is up to 70% lower than leading cloud‑hosted LLMs like GPT‑4.1. If that holds in your usage, you can keep more interactions on a large model without constantly worrying about spend.

Relative cost per token comparison

Illustrative cost ratio if the 70% claim holds. Run your own pricing math.

Benchmarks vs NVIDIA and Groq

The vendor head‑to‑heads to pay attention to:

  • Llama 4 Maverick 400B: 2,522 tokens per second on Cerebras vs 1,038 on NVIDIA Blackwell in the cited run.
  • Llama3.1 8B: 1,800 tokens per second and stated 2.4x over Groq on their setup.
  • Llama3.1 70B: 450 tokens per second, which makes 70B responses feel like small models for chat UX.

Take any vendor result as a starting point, not the finish line. I want to see the same prompts run end‑to‑end with streaming latency, time to first token, and sustained throughput under load. If you do dependency‑heavy prompts or tool calls, measure the whole loop, not just the raw decode speed. I’ve written before about why model output can degrade with ambiguous prompts and long chains of thought in LLMs as a Lossy Encyclopedia. That still applies here. Speed does not fix unclear instruction or weak retrieval.

Model limits and GPT‑OSS‑120B

On model limits, the notable one is context:

  • Qwen3 235B Instruct context window: 64K on free and up to 131K on paid.
  • Other models are in the standard short‑to‑mid range depending on the checkpoint.

For GPT‑OSS‑120B‑class workloads, the interesting bit is whether you get both speed and stability. Multi‑hundred‑B parameter models look great in headline throughput, but you still need to check:

  • Token quality on long answers.
  • Stability under batch load and concurrent connections.
  • Latency tails for the slowest 1% of requests.
Context window comparison for Qwen3 235B Instruct

Context windows for Qwen3 235B Instruct. Enough for large repos and long documents.

ui2: a better way to route intent across providers

Developers are converging on a simple idea: keep user intent separate from model and provider wiring. Evan Zhou’s unified intent interface, ui2, pushes this pattern. You define an intent once, then route it to the provider that best fits current constraints like speed, cost, or context length. With Cerebras now in OpenRouter and Hugging Face, this matters. You can:

  • Send speed‑critical tasks to Cerebras when you need high TPS and low power draw.
  • Send long context reads to the Qwen3 235B Instruct slot when the document is big.
  • Fail over to another provider if a region is at capacity.

The point is not new plumbing. It is fewer code paths to maintain and quicker iteration. A unified intent layer makes your app less tied to one vendor’s quirks and lets you swap in faster backends like Cerebras when it actually helps.

Operational notes for teams that care about scale

  • Capacity and regions: Cerebras added six data centers across North America and Europe. If your users are split across regions, test RTT and streaming jitter per region.
  • SLAs and versioning: confirm uptime commitments and whether model IDs remain stable across upgrades. Reproducibility matters for audits and regression testing.
  • Power and facilities: if you run private clusters or need a green accounting trail, the one‑third power claim is meaningful for total cost of ownership.

What I would test first

  • Time to first token vs tokens per second: measure both. The stream can feel instant even if TPS is average, but in practice both matter for perceived speed.
  • Agent loop depth: run a 5 to 10 step tool‑using agent and measure wall‑clock time, not just model time. Faster decode can expose bottlenecks elsewhere.
  • Long context reading: drop a 50K token codebase or doc set into Qwen3 235B Instruct and evaluate recall fidelity across sections.
  • Batch concurrency: hit the free tier with sustained concurrent requests and look for tail latency spikes or throttling.
  • Cost run‑rate: estimate daily token burn at your app’s average input and output lengths, then compare to your current stack.

Practical build recipes

  • Fast chat and support: route chat to Llama3.1 70B on Cerebras for quick answers. If a user uploads a long PDF or code bundle, switch the same intent to Qwen3 235B Instruct for the follow‑up.
  • Agentic coding: keep the planner and tool router on a small model, but push code synthesis and refactor steps to a high‑TPS slot so the loop doesn’t stall. Cache intermediate context to avoid paying to re‑read the same files.
  • Media pipelines: if you’re orchestrating stages like transcript, outline, and script, the slowest stage sets your throughput. Use a high‑TPS inferencer for the language stages and then see if your video or image steps need revision. For more on real‑time creative steps, see Krea’s realtime Sculpt‑ad‑Video and the production view in Veo 3.

Accuracy and precision notes

Cerebras runs native 16‑bit weights. That tends to be the right trade for inference. I would still run task‑specific evals for your domain. If you care about code generation with strict linting, logic puzzles, or compliance answer formats, bring your own test suite. The speed win is only useful if the model keeps quality on your prompts.

Risk checks and unknowns

  • Queue behavior on the free tier: do bursts get smoothed or rejected. What are the hidden per‑minute limits.
  • Cold starts: test morning traffic and weekend spikes. Watch for time to first token drift at peak.
  • Model churn: how often do they rotate checkpoints and do you get stable IDs for reproducibility.
  • Provider mesh complexity: if you adopt ui2 and multiple providers, add monitoring that tags each request with the backend so you can correlate quality and latency by route.

How this fits in the builder kit

For many apps, the choice is not one model. It is a matrix of speed, cost, and context. I would pair a unified intent layer like ui2 with a provider mesh that includes Cerebras for high throughput work, plus a second provider tuned for reasoning tasks where your evals say it does better. If you care about on‑device or edge, see Apple FastVLM and MobileCLIP2 for a different angle on latency. If you are building creative pipelines, compare Cerebras speed claims with your current video and image steps from posts like Veo 3 or Lucy‑14B on Fal.ai. For content quality and prompt structure pitfalls, see LLMs as a Lossy Encyclopedia.

Bottom line

Free 1M tokens per day makes it easy to try. The speed numbers are strong, especially on large models where many stacks still crawl. If your app depends on sub‑second interactivity or long context, put Cerebras on your short list and measure it on your prompts. Then add an intent router like ui2 so you can send the right jobs to the right backend without wiring everything twice.