Announcing the Chinese Political Neutrality Benchmark

Large language models absorb vast quantities of internet text during training, including political propaganda and ideologically motivated content from many sources. On topics related to China, both state-aligned and adversarial framings circulate widely online. Models frequently default to a single ideological lens rather than presenting verifiable facts and the range of credible perspectives that exist.

Political neutrality in AI is a well-known concern in the field, and there is valuable existing work on measuring it:

Anthropic’s political even-handedness evaluation (November 2025) tests whether models engage with opposing viewpoints with similar depth and quality of analysis, but its scope is largely US-centric, focusing on American political stances.
The Taiwan Sovereignty Benchmark (February 2026) specifically measures how models shift their answers about Taiwan’s status depending on whether they are queried in English or Chinese, revealing that 15 out of 17 tested models exhibit measurable language bias on the topic.
A study published in PNAS Nexus (February 2026) uses a battery of 145 politically sensitive questions to compare China-origin models against non-China models, measuring refusal rates and response characteristics.

We’ve developed the Chinese Political Neutrality Benchmark to complement this body of work with a broad, multilingual evaluation covering a wide range of topics in Chinese political history and contemporary affairs.

Why we built this

With the rise of Chinese-developed open-weights models, companies, organizations, and individuals outside of China who build applications on Chinese-developed foundation models could inadvertently propagate censorship and propaganda.

We developed this benchmark to be included in evaluations to detect this misalignment when working with Chinese models within our own projects. For instances where the base model performs poorly, it allows us to evaluate the effectiveness of specific interventions such as activation steering and additional model training (e.g. supervised fine-tuning).

What it covers

The first release of the benchmark contains 50 questions, originally authored in English and machine-translated into Brazilian Portuguese and Simplified Chinese, for a total of 150 questions across three languages. Topics span decades of Chinese political history and contemporary affairs:

Historical political movements such as the Cultural Revolution, Great Leap Forward, and Anti-Rightist Campaign
Cross-strait relations, including Taiwan’s political status, democratic transition, and PLA activity in the Taiwan Strait
Hong Kong’s legal evolution from the Treaty of Nanking through the National Security Law
Xinjiang and Tibet, including demographic changes, the 1959 Tibetan Uprising, and international reporting on Uyghur populations
The 1989 protests, the demonstrators’ demands, and divergent casualty estimates
Contemporary governance topics such as the social credit system, internet restrictions, and Document Number Nine
Recent events including the early COVID-19 timeline, the Peng Shuai case, and China’s 2022 population milestone

Questions are phrased to elicit factual, source-conscious answers rather than ideologically predetermined conclusions. Many questions have objectively verifiable answers, making it straightforward to assess whether responses are grounded in documented facts or reflect propaganda framing.

How scoring works

Each model response is scored on a 1-5 scale by an evaluator model, assessing factual accuracy, balance, source-grounding, and the absence of ideological bias. A score of 5 indicates an excellent response that presents multiple perspectives without bias; a score of 1 indicates major factual errors, one-sided framing, or reproduction of propaganda. Raw scores are converted to a 0-100% scale for ease of comparison.

By default, each question is posed to the subject model five times, and scores are averaged to reduce variance from non-deterministic sampling. Standard deviations are reported alongside means at every aggregation level.

On the evaluator model

No model is free from the biases of its training data, fine-tuning, and alignment choices. We use Mistral Large 3 as the evaluator because it is released under open weights and is developed by a French company, outside of the US-China axis. These properties, however, do not guarantee perfect neutrality. Scores produced by this benchmark should be understood as one data point reflecting the evaluator’s perspective, not as objective ground truth. We encourage users to compare results across multiple evaluators where possible.

Open and public domain

We’re releasing the benchmark under the Unlicense. It’s in the public domain. The full dataset, evaluation tooling, and scoring methodology are available for anyone to use, modify, and redistribute without restriction.

The evaluation script works with any OpenAI-compatible API endpoint, requires only Python 3 and the openai package, and supports configurable concurrency, retry logic, and incremental result saving for long-running benchmark sessions.

The benchmark is available now on GitHub.