Pure white background, black sans serif text reading GLM-4.7.

GLM-4.7: Z.ai’s open-weights coding model pushes harder on agents, tools, and UI

GLM-4.7 dropped on December 22, 2025, and Zhipu AI is positioning it as a straightforward thing: an AI coding partner that is better than GLM-4.6 at agentic coding, multi-step tool use, and long-horizon execution. That is the whole story. No mysterious new category. Just a model that is trying to behave better in the situations people keep putting LLMs into: terminals, codebases, browsing, and agent frameworks.

What I like about this release is the direction. Z.ai is aligning three pieces that usually drift apart: benchmark results, agent framework compatibility, and the annoyances you hit in real multi-turn sessions, like dropped constraints, tool call flakiness, and the model re-deriving a plan every turn until it forgets half of it.

The headline changes Z.ai is pointing at

  • Core coding gains vs GLM-4.6: 73.8% on SWE-bench Verified, 66.7% on SWE-bench Multilingual, and 41.0% on Terminal Bench 2.0 are the numbers they lead with, plus a 41% result on Terminal Bench with a +10.0% gain called out in the release post.
  • Tool use: improved multi-step tool performance on c4b2-Bench and better web browsing results on BrowseComp.
  • Reasoning boost with tools: HLE with tools at 42.8%, highlighted as a +12.4 gain vs GLM-4.6.
  • UI output quality: they brand this as vibe coding. Cleaner webpages, better defaults for layout and sizing, and nicer slides.
  • Agent integrations: explicit focus on Claude Code, Kilo Code, Cline, Roo Code, and similar frameworks where the model has to plan, call tools, and keep state.

The quick picture: where GLM-4.7 moved the most vs GLM-4.6

The table below is the full dump, but the simplest lens is: how much did the agent-facing benchmarks move?

Bar chart of GLM-4.7 minus GLM-4.6 deltas on key agent and tool benchmarks

Deltas are computed from Z.ais published table: GLM-4.7 score minus GLM-4.6 score.

Benchmarks: all the numbers Z.ai published

Z.ai published a 17-benchmark comparison table against GLM-4.6 and several other models: Kimi K2 Thinking, DeepSeek-V3.2, Gemini 3.0 Pro, Claude Sonnet 4.5, GPT-5 High, and GPT-5.1 High. The table is split across reasoning, code agent, and general agent tasks.

Benchmark GLM-4.7 GLM-4.6 Kimi K2 Thinking DeepSeek-V3.2 Gemini 3.0 Pro Claude Sonnet 4.5 GPT-5 High GPT-5.1 High
Reasoning
MMLU-Pro 84.3 83.2 84.6 85.0 90.1 88.2 87.5 87.0
GPQA-Diamond 85.7 81.0 84.5 82.4 91.9 83.4 85.7 88.1
HLE 24.8 17.2 23.9 25.1 37.5 13.7 26.3 25.7
HLE (w/ Tools) 42.8 30.4 44.9 40.8 45.8 32.0 35.2 42.7
AIME 2025 95.7 93.9 94.5 93.1 95.0 87.0 94.6 94.0
HMMT Feb. 2025 97.1 89.2 89.4 92.5 97.5 79.2 88.3 96.3
HMMT Nov. 2025 93.5 87.7 89.2 90.2 93.3 81.7 89.2
IMOAnswerBench 82.0 73.5 78.6 78.3 83.3 65.8 76.0
LiveCodeBench-v6 84.9 82.8 83.1 83.3 90.7 64.0 87.0 87.0
Code Agent
SWE-bench Verified 73.8 68.0 73.4 73.1 76.2 77.2 74.9 76.3
SWE-bench Multilingual 66.7 53.8 61.1 70.2 68.0 55.3
Terminal Bench Hard 33.3 23.6 30.6 35.4 39.0 33.3 30.5 43.0
Terminal Bench 2.0 41.0 24.5 35.7 46.4 54.2 42.8 35.2 47.6
General Agent
BrowseComp 52.0 45.1 51.4 24.1 54.9 50.8
BrowseComp (w/ Context Manage) 67.5 57.5 60.2 67.6 59.2
BrowseComp-Zh 66.6 49.5 62.3 65.0 42.4 63.0
c4b2-Bench 87.4 75.2 74.3 85.3 90.7 87.2 82.4 82.7

Thinking modes: the part that is aimed at long sessions

Z.ai highlights three thinking modes, and they are clearly designed around agent loops:

  • Interleaved Thinking: the model thinks before every response and tool call. The intent is better instruction following and better tool selection.
  • Preserved Thinking: in coding agent scenarios, the model retains thinking blocks across turns and reuses prior reasoning instead of re-deriving it. The goal is less drift and less information loss in long sessions.
  • Turn-level Thinking: per-turn control over reasoning. Disable it for quick tasks to reduce latency and cost; enable it for harder steps for better stability.

Z.ai also notes that for multi-turn agentic evaluations, especially c4b2-Bench and Terminal Bench 2.0, Preserved Thinking should be enabled.

UI output and visual generation: web pages, SVG, voxel, and a physics demo

GLM-4.7 is being marketed as better at front-end aesthetics. Alongside the core coding story, the release materials include examples of cleaner web UI generation with better layout defaults and color harmony, plus better-looking slides with more accurate sizing.

There are also demos that go beyond HTML and CSS:

  • SVG generation: examples include stylized robot heads with neon and metallic styling, plus vector art like gaming consoles.
  • Voxel generation: users are sharing 3D voxel objects like a voxel snowman, and Z.ai shows artifact demos like a voxel pagoda scene delivered as a single HTML file.
  • GLM-4V: a multimodal variant is shown on a physics reasoning test with colorful spheres bouncing inside a rotating heptagonal container, used to probe motion and interaction understanding.

There is even a chain-of-thought style screenshot where the model reasons through the slang term werd. I do not care about that example on its own, but it is consistent with the broader push: make multi-turn reasoning more stable.

Where you can use it: chat, APIs, agents, and local weights

  • Z.ai chat: GLM-4.7 is selectable in the Z.ai web UI via the model dropdown.
  • Coding agents: Z.ai explicitly calls out Claude Code, Kilo Code, Roo Code, and Cline. Kilo Code screenshots show a chat interface identifying as powered by GLM-4.7.
  • API: the chat completion endpoint is https://api.z.ai/api/paas/v4/chat/completions with both normal and streaming examples. The docs show a max_tokens setting of 4096 in example calls.
  • SDKs: official Python and Java SDKs are available. The Python install shown is pip install zai-sdk, and the Java dependency shown is ai.z.openapi:zai-sdk:0.1.3.
  • OpenAI-compatible client option: the docs show using the OpenAI Python SDK with base_url set to https://api.z.ai/api/paas/v4/.
  • Open weights: model weights are available on HuggingFace and ModelScope.
  • Local inference: vLLM and SGLang are called out as supported frameworks.
  • OpenRouter: mentioned as another access route for developers.

On subscription and positioning: Z.ai pitches a GLM Coding Plan with auto-upgrades to GLM-4.7, and they claim it is about 1/7th the price of Claude with 3x the usage quota. If you are running agent workflows where you burn tokens fast, that kind of pricing claim is not a footnote, it is the difference between using a model occasionally and wiring it into everything.

Evaluation settings notes from Z.ai

Z.ai includes the testing settings they used for the published results:

  • Default settings: temperature 1.0, top-p 0.95, max new tokens 131072.
  • Terminal Bench and SWE-bench Verified: temperature 0.7, top-p 1.0, max new tokens 16384.
  • c4b2-Bench: temperature 0, max new tokens 16384, plus extra prompting adjustments in Retail and Telecom to prevent failures from user-ended interactions, and an Airline domain set of fixes based on the Claude Opus 4.5 release report.

The part I agree with

Z.ai says benchmarks are a checkpoint, but the main measure is how it feels and how well it fits into normal work. That framing is right for agent models. If Preserved Thinking reduces drift and the tool call parsers are more reliable, that will show up in codebases and terminals long before it shows up in a leaderboard screenshot.

If you want more background on where open-weights models have been landing lately, this is a useful companion: 2025 Open Models Year in Review: DeepSeek R1, GLM 4.6, and the New Tier List. And if you are comparing how different teams are pushing long-horizon coding agents, this post is relevant context too: GPT-5.2-Codex: Better Long-Horizon Agentic Coding, Bigger Diffs, and Stronger Defensive Security.

One last operational detail that matters if you maintain integrations: a model support diff shows GLM-4.6-Air being removed in favor of GLM-4.7, alongside new tool call parsers and Git integration support. That is the unglamorous side of a model release, but it is the side that determines whether an agent framework stays stable week to week.