Pure white background. Centered black sans serif text reading 'Open source AI is not for running locally'.

2025 Open Models Year in Review: DeepSeek R1, GLM 4.6, and the New Tier List

At the start of 2025, open-weight models were still a tradeoff. You picked them for privacy, cost control, or fine-tuning. If you just wanted the safest general-purpose results, you usually reached for a closed model and moved on.

By the end of 2025, that framing is outdated.

Open models are now good enough that a lot of teams can treat them as a normal option, not a backup plan. That does not mean they win every head-to-head comparison. Closed models still tend to feel more reliable across messy prompts, weird edge cases, and broad world knowledge. But 2025 is the year where open stopped being “only for people who care a lot” and became something you can justify to a normal engineering org without a philosophy lecture.

The scale problem is why model debates go nowhere

Most “best model” arguments are broken because people are arguing about different subsets of releases.

Hugging Face sees around 1,000 to 2,000 model uploads per day. That is 30,000 to 60,000 per month. Interconnects curates roughly 50 models per month in their Artifacts series, around 600 per year, and that still leaves a lot on the cutting room floor.

Bar chart showing Hugging Face uploads per day from 1000 to 2000

Interconnects cites 1,000 to 2,000 model uploads per day on Hugging Face.

So if you think open is “still behind,” the first question is simple: which models did you test, and did you test the ones that match your workload?

What changed in 2025: performance got close, and licensing got real

On benchmarks, multiple sources now report open-weight models within a couple points of top closed systems on some evaluations, with the remaining gap around 1.7% in certain comparisons. That is not the whole story, but it changes what a cost and privacy argument looks like.

Bar chart showing approximate benchmark gap shrinking from about 8 percent to about 1.7 percent

Reported benchmark gap shrinkage is approximate and depends on the eval set.

The second change is more important than the chart: licensing norms shifted. Teams are far more willing to build on open weights when the legal story is boring. Permissive licenses turn “cool demo” into “something procurement can sign off on.”

The three releases that set the tone

1. DeepSeek R1

DeepSeek R1 had the biggest blast radius this year.

  • It showed a small, focused team can ship a reasoning model that everyone else has to respond to.
  • It shipped under an MIT license, after DeepSeek V3 used a custom license with restrictions.

The MIT part mattered. A top-tier release under a permissive license changes incentives for everyone else, especially for labs that want adoption outside their home market.

2. Qwen 3

Qwen 3 is not one model. It is a whole line: general models in multiple sizes, dense and MoE variants, plus vision, omni, coding, embedding, and rerankers.

Interconnects calls out that Qwen has overtaken Llama in downloads and fine-tunes, with download trends tracked by projects like The ATOM Project.

My personal view: I do not recommend most Qwen models for my day-to‑day. If you want Qwen, I mostly see the case for multimodality. Otherwise, I reach for other families first.

3. Kimi K2

Kimi K2 is a different kind of success. Moonshot is described as running one main model line at a time, with smaller experiments feeding the next generation. K2 landed as a favorite for raw performance and a distinct writing style.

And yes, if you are going strictly by “benchmark leader open model”, Kimi K2 Thinking gets the crown for 2025 in a lot of people’s rankings.

The strong releases that most teams should know

  • MiniMax M2 made a surprisingly big jump from M1 and stuck around in usage even after free access ended. If you want a short list of leading open models right now, M2 is on it.
  • GLM-4.5 and GLM-4.6 are the ones I want to emphasize more. GLM has been going strong for years, but it stayed extremely niche until 4.5, and now 4.6. For a while, they took the lead as a cost‑efficient open model for tool calling, and I would argue they are still in that spot. This is why it is suspected that Cursor’s Composer model and Windsurf’s SWE 1.5 model are based on it.
  • GPT-OSS is OpenAI validating open weights. In my first coverage I was a bit harsh and called it benchmaxed and sloptimized, which is still true to a certain extent. But they are also very good for very cheap, fast tool calling, and that matters a lot if you are building agent workflows.
  • Gemma 3 is still one of the best Western options for open vision, and it is strong for multilingual under 30B.
  • Olmo 3 matters for research because Ai2 releases the full stack: data, code, weights, logs, methods. If you care about understanding how these systems are built, that is what “open” should mean.

Also worth noting: other leading open models right now include DeepSeek V3.2 and MiniMax M2. If you need multimodal open options beyond Gemma, GLM-4.6V is also in the mix.

Usage is still a separate discussion from benchmarks. If you want a grounded view of what people use across providers, this pairs well with OpenRouter’s 100 Trillion Token Study.

Niche winners are where open models win cleanly

The most practical part of the Interconnects recap is the niche list. A lot of teams do not need one model to do everything. They need one model to do one thing reliably and cheaply.

  • Parakeet 3 for speech‑to‑text. It is fast enough on‑device that it can beat cloud tools on end‑to‑end latency for some workflows.
  • Nemotron 2 from NVIDIA, using mamba2‑transformer hybrids for speed, especially at long context, plus a lot of released data.
  • Moondream 3 as the open vision model people keep pointing to if you want strong VLM behavior.
  • Granite 4 from IBM, consistently solid releases, now with mamba‑attention and MoEs, and a writing style that does not read like internet sludge.
  • SmolLM3 for on‑device work. 3B, capable, and Hugging Face shipped unusually good training resources and intermediate checkpoints.

This is also where agents start to shape model choice. Agentic systems care about tool use, formatting stability, and being predictable under tight prompts. If you are tracking that shift, this connects with what I wrote about platform direction here: Anthropic’s Agent Mode, Claude Code Slack Tagging, and Claude Skills.

The 2025 tier list: who is driving open releases

Interconnects’ tiering is useful because it distinguishes frontier general model labs from specialists.

Frontier: DeepSeek, Qwen, Moonshot AI

Close competitors: Zhipu, MiniMax

Noteworthy: StepFun, InclusionAI / Ant Ling, Meituan Longcat, Tencent, IBM, NVIDIA, Google, Mistral

Specialists: OpenAI, Ai2, Moondream, Arcee, RedNote, HuggingFace, LiquidAI, Microsoft, Xiaomi, Mohamed bin Zayed University of Artificial Intelligence

On the rise: ByteDance Seed, Apertus, OpenBMB, Baidu, Marin Community, InternLM, OpenGVLab, Skywork

Honorable mentions: Meta, Beijing Academy of Artificial Intelligence, Multimodal Art Projection, Huawei

The Meta note is the uncomfortable one: if Llama’s future is uncertain, Western teams lose a default anchor.

How I think teams should choose open models now

If you are evaluating open models in 2026, I would stop asking “what is the best open model” and start asking a few narrower questions:

  • What is the job? General chat, coding, speech, vision, embeddings, reranking, long context, on‑device.
  • What is the failure cost? If a wrong answer is expensive, you should bias toward robustness and stronger eval coverage.
  • Do you need fine‑tuning? Dense mid‑size models often remain easier to tune than giant MoEs.
  • What is your licensing comfort level? Permissive is simpler, custom licenses create risk later.
  • Where is your latency and cost ceiling? The best model is the one you can run within your budget and response time needs.

My take: the best part of open source is competition and fine‑tuning

At this point, my main thought about the benefits of open source is simple.

  • Providers compete. When weights are open, you do not have one cloud holding the keys. Different clouds compete to run the same model faster, cheaper, and better than the others. With closed models you are usually stuck with the direct provider, or maybe a managed option like Vertex or Bedrock.
  • You get access to speed monsters. Open weights are how you get options like Cerebras or Groq hosting the same model at much higher speeds.
  • You can fine‑tune for your stack. Cursor’s Composer is the best example of why fine‑tuning matters. They likely took something close to GLM-4.6, shaped it to work inside their agent, and then run it on very fast hardware. That is where open models start to look “better” than the raw base model.
  • Running locally is a distant benefit for most people. It is nice if you have the hardware. Most people do not. If you are running something locally on weak hardware, it is usually not that useful for the frontier tasks people care about now.

Open vs closed is no longer a moral argument. It is an engineering and procurement decision.

2025 made open models a practical choice. 2026 is about operational discipline, and teams building boring reliability around their model stack. If you are watching AI move from casual use to core systems, this ties directly to what I wrote here: Enterprise AI Adoption in 2025: From Casual Chat to Core Infrastructure.