I built ChinaBench because standard LLM benchmarks leave out a behavior that many users care about: what a model refuses to say. A model can do well on coding, math, or reasoning and still become unusable for a given task if it refuses, dodges, or sanitizes a narrow class of prompts. ChinaBench measures that directly.
The benchmark is open source, live at china-bench.vercel.app, and available on GitHub. It runs 60 prompts across 10 categories covering politically sensitive China-related topics: Tiananmen, Tibet, Uyghur, Taiwan, Hong Kong, CCP, Cultural Revolution, Falun Gong, censorship, and territorial questions. Each response is scored as Compliant, Refused, or Evasive by an LLM judge.
The key update I wanted is now in the project too: the judge model is configurable. That matters because any benchmark using LLM-as-judge introduces a second model into the evaluation loop. If the judge is fixed, you are stuck with one model’s interpretation of compliance, refusal, and evasiveness. If the judge is swappable, you can inspect how stable the scores are under different graders instead of pretending one grader is neutral by definition.
What the first run shows
The sample results cover 6 models, 60 prompts each, for 360 judged responses. The spread is wide.
gpt-oss-120b came out at 92% compliant overall, with 8% refusal and no evasive responses in this sample. It posted 100% compliance on Tiananmen, Tibet, Taiwan, Hong Kong, Falun Gong, and censorship prompts, while dipping to 67% on Uyghur and 83% on CCP, Cultural Revolution, and territorial prompts. That is a strong baseline and it shows that broad willingness to answer still does not mean perfectly uniform behavior.
At the other end, deepseek-v3.2 hit 0% compliance, with 92% refusal and 8% evasive responses. minimax-m2.5 managed 2% compliance. glm-5 managed 3%. Those are near-total block profiles on this prompt set. If your use case depends on direct answers in this area, those models are not failing gracefully. They are just failing.
The middle of the table is where it gets more interesting. qwen3-next-80b landed at 33% compliance, 58% refusal, and 9% evasive. kimi-k2.5 landed at 17% compliance, 82% refusal, and 1% evasive. Those are not just weaker versions of the highly restrictive models. They show partial willingness, topic-specific openings, and uneven refusal behavior depending on category.
That unevenness is the main point. People often talk about censorship behavior as if there is one clean split between censored and uncensored models. The data does not look like that. Some models answer a fair amount. Some refuse almost everything. Some answer one sensitive area and block another nearby one. That pattern is more useful than a slogan.
Category-level results matter more than the average
Overall compliance is a good headline metric, but category breakdowns tell you where the filter is shaped differently.
qwen3-next-80b is a good example. It scored 67% on Cultural Revolution prompts and 50% on Hong Kong and territorial prompts, but 0% on Uyghur. If you only looked at the 33% overall figure, you would miss the fact that the refusal policy appears heavily topic-specific.
kimi-k2.5 showed 33% compliance on Tiananmen and Taiwan prompts, 17% on Tibet, CCP, Cultural Revolution, Falun Gong, censorship, and territorial prompts, and 0% on Uyghur and Hong Kong. Again, not random. It is a selective refusal profile.
glm-5, minimax-m2.5, and deepseek-v3.2 were much more rigid. deepseek-v3.2 posted zeros across all ten categories. minimax-m2.5 only moved off zero on censorship prompts at 17%. glm-5 had isolated 17% scores on Tibet and Taiwan, but zero on the rest. Those models are not merely stricter. They are close to categorical blockers on this benchmark.
Why I made the judge configurable
If you use an LLM judge, the first fair criticism is obvious: what if the judge is biased, too strict, too loose, or inconsistent about what counts as an answer? That criticism is valid. My answer was not to pretend it does not exist. My answer was to expose the judge as a configurable component.
Now you can run the same benchmark with a different judge model and compare the outcome. If one judge labels borderline responses as evasive while another calls them compliant, you can see that. If scores are stable across judges, that raises confidence. If they move around a lot, that tells you something too.
I think this is how benchmarks like this should be built. If there is a judgment layer, make it inspectable. Do not hide it behind a single default and call the result objective.
The surprising part is not refusal. It is the shape of refusal.
I did not build this expecting every model to answer everything. That would have been a silly expectation. The part that stands out is how uneven the behavior is once you put models side by side under the same prompt set.
gpt-oss-120b at 92% compliance and deepseek-v3.2 at 0% compliance is a huge spread. qwen3-next-80b and kimi-k2.5 sitting in the middle is more interesting than a clean binary split. It suggests that refusal behavior is not just a top-level model policy; it is often a patchwork of topic-specific filters, response templates, and thresholds for what counts as disallowed content.
That is useful for model selection. If you are building a product where users need direct answers on sensitive material, refusal rate is not some side note. It is part of capability. A model that declines the task is not helping you just because it scores well on a benchmark somewhere else.
This is similar to why I like benchmarks that measure real behavioral failure modes instead of only academic performance. In BullshitBench v2, the point was that model quality includes whether a system pushes back when it should. ChinaBench looks at a different failure mode, but the same principle applies. Benchmarks get better when they measure behavior users will run into.
How to use ChinaBench
If you are a developer comparing models, ChinaBench gives you a repeatable way to test refusal and evasiveness. If you are studying alignment or policy shaping, it gives you a narrow but useful lens on how models behave under politically sensitive prompts. If you just want to inspect provider differences yourself, the project is open source and easy to fork.
You can run it in the browser at the live demo, inspect the implementation on GitHub, or run it locally on port 9147. The target models are configurable, and now the judge is too. That makes it more useful both as a public benchmark and as a harness for your own experiments.
If you want more context on the model set around this period, I also tracked that separately in Every AI Model Released in February 2026. But the point here is narrower. ChinaBench is not trying to make a moral argument. It is trying to measure a behavior that standard leaderboards mostly ignore.
The current data already shows why that is worth doing. Modern LLMs do not just differ on intelligence, latency, and cost. They also differ a lot in what they are willing to say, and the pattern is messier than most people expect.