Gemini 3.1 Flash-Lite: Cost, Speed, and Intelligence

Gemini 3.1 Flash-Lite Preview dropped on March 3, 2026, positioned as a drop-in replacement for low-cost models. It did not land that way. The intent was clear, the execution left developers with better alternatives in almost every direction.

The model scores a 34 on the Artificial Analysis Intelligence Index and outputs around 389 tokens per second. That speed puts it in the 99th percentile. The intelligence score puts it well behind Gemini 3.1 Pro Preview at 57, and even behind Gemini 3 Flash at 46. So the question is whether speed alone justifies the tradeoff, and for most use cases the answer is no.

Not Pareto Optimal on Cost vs. Intelligence

The clearest problem with this model is that Gemini 3 Flash without reasoning is both cheaper and more intelligent. That means Flash-Lite is not on the Pareto frontier when you plot cost against intelligence. There is a strictly better option available at a lower price with higher capability. That is a tough position to defend as a product.

Google also raised the price compared to prior Flash-Lite models. Their pricing hit a low point with Gemini 2, and the current Flash-Lite costs more than three times the Gemini 2.0 Flash model. That trend makes this release harder to recommend to developers who were already watching the cost curve tighten. It has been a consistent pattern from Google since the Gemini 2 era, and this release continues it without offering enough in return to justify the direction.

The chart below shows where Flash-Lite sits relative to other models on the cost versus intelligence axis. The ideal quadrant is top-left: high intelligence, low cost. Nothing lives there right now, but Flash-Lite is notably far from it.

Intelligence vs Cost scatter plot

Reading the chart from left to right: gpt-oss-120B sits at intelligence 25, cost index 16. Grok 4.1 Fast is at 39 and 48. Gemini 3 Flash without reasoning lands at 35 and 64. DeepSeek V3.2 at 42 and 80. Gemini 3.1 Flash-Lite at 33 and 100. MiniMax-M2.5 at 42 and 110. Gemini 3 Flash at 46 and 256. Kimi K2.5 and GLM-5 both around 46 at higher cost points. Gemini 3.1 Pro Preview at 57 and 512. Flash-Lite is dominated on both axes by Gemini 3 Flash without reasoning. That is the whole problem in one data point.

Where Flash-Lite Actually Sits on Speed

Speed is the real story here. At 389 tokens per second, Flash-Lite is genuinely fast. The models that score higher on intelligence, like Gemini 3.1 Pro Preview and GLM-5, top out around 80 tokens per second. GLM-4.7 on Cerebras hits around 550 tokens per second but lands at roughly the same intelligence tier as Flash-Lite. MiniMax-M2.5 on SambaNova is comparable in speed at around 395 tokens per second with slightly higher intelligence.

The pattern holds across the board: smarter models are slower, faster models are not smarter, and nothing cheap is both fast and smart. Flash-Lite occupies a real position on the speed axis, but that position is not exclusive enough to carry the full value proposition on its own.

Intelligence vs Output Speed scatter plot

One note on the speed chart: GPT-OSS-120B from Cerebras is not plotted here because adding it would make the chart unreadable. If raw speed with no other constraints is what you need, that model is by far the fastest option available. It did not fit on this chart cleanly, but it is worth knowing it exists in that category.

The Multimodal Angle

This is where Flash-Lite earns its keep for a specific set of use cases. It supports text and image input, with audio and video also in scope, along with a 1 million token context window. The models that are faster than it, like GLM-4.7 on Cerebras at 550 tokens per second, are not multimodal. That makes Flash-Lite the fastest multimodal model available at this point.

There are cheaper multimodal options, but if you need multimodal input and low latency together, Flash-Lite is a legitimate choice. That combination is narrow, but it is real. High-volume transcription pipelines, fast image-to-text workflows, or latency-sensitive agentic tasks that need to ingest images are the kinds of applications where this model fits. Outside of that specific scenario, the numbers point elsewhere, and they point there clearly.

It is also a reasoning model with chain-of-thought, which means it generates more tokens by default. That inflates cost and reduces effective throughput compared to a non-reasoning model at the same speed tier. If additional benchmarks surface a no-reasoning or low-reasoning configuration that performs well, the cost-intelligence picture could shift. For now, it stays where it is.

The Verdict

The release failed at its stated goal. It was meant to be a drop-in replacement for low-cost models, and it is not competitive on that axis. Gemini 3 Flash without reasoning beats it on both cost and intelligence. For most developers, that should be the answer.

The speed and multimodal combination is a real niche, and it is probably the only defensible one here. If your pipeline demands high throughput with image or audio inputs, Flash-Lite is likely your best current option. Outside of that, the numbers are not in its favor.

Google has been raising prices consistently since the Gemini 2 era, and this release continues that pattern without offering enough to justify it. For a broader look at how the current model tier stacks up across labs, the AI Labs LLM Rankings 2026 post covers the full picture. And if you want context on how speed benchmarks can be misleading at the high end, the GPT-5.3-Codex-Spark speed analysis is worth reading alongside this one.

Links

They're clicky!

Follow me on X Visit Ironwood AI →

Adam Holter

Founder of Ironwood AI. Writing about AI stuff!