Recognition: 2 theorem links
· Lean TheoremCMMLU: Measuring massive multitask language understanding in Chinese
Pith reviewed 2026-05-16 20:58 UTC · model grok-4.3
The pith
Most large language models score below 50 percent on a new Chinese multitask understanding benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that CMMLU exposes significant deficiencies in LLMs' Chinese knowledge and reasoning, as most models cannot achieve 50 percent accuracy on this broad set of subjects even with advanced prompting techniques.
What carries the argument
CMMLU, a massive multitask benchmark with questions in Chinese designed to assess knowledge and reasoning across many subjects.
If this is right
- There is significant room for improvement in LLMs regarding Chinese language processing.
- Standard prompting methods do not fully address the performance shortfalls.
- Experiments on impacting factors can inform future model enhancements.
Where Pith is reading between the lines
- Creating similar benchmarks for other languages would help identify if this is a general multilingual issue.
- Models specifically trained or fine-tuned on Chinese data may show better results on this benchmark.
- This evaluation framework could be used to measure progress in future iterations of Chinese LLMs.
Load-bearing premise
The questions in CMMLU accurately represent the knowledge and reasoning demands of real Chinese-language tasks across the covered subjects.
What would settle it
If an LLM achieves high accuracy on CMMLU but performs poorly on authentic Chinese language applications outside the benchmark, this would falsify the assumption that the benchmark measures relevant capabilities.
read the original abstract
As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CMMLU, a new Chinese-language benchmark covering natural sciences, social sciences, engineering, and humanities. It reports direct accuracy measurements on 18 multilingual and Chinese-oriented LLMs across standard, few-shot, and chain-of-thought prompting settings, finding that most models remain below 50% average accuracy (versus a 25% random baseline) and identifying factors that affect performance.
Significance. If the questions are valid, the work supplies a needed Chinese counterpart to MMLU with reproducible multi-model, multi-setting measurements that credibly document a substantial performance gap. The empirical focus and exploration of prompting factors are strengths that would support acceptance once validation details are added.
major comments (1)
- [§3] §3 (Dataset Construction): The manuscript describes internal question curation and filtering but supplies no inter-annotator agreement statistics, external expert review results, or contamination checks against common Chinese web corpora. Because the central claim (LLMs <50% vs. 25% random) requires that items are unambiguous and representative, this omission is load-bearing and must be addressed with concrete validation numbers before the performance gap can be confidently attributed to model limitations rather than benchmark noise.
minor comments (2)
- [Table 2] Table 2 and Figure 3: Per-subject accuracy tables would benefit from reporting the exact number of questions per category and any run-to-run standard deviations to allow readers to assess statistical reliability of the reported gaps.
- [§5] §5 (Analysis): The discussion of factors impacting performance is useful but would be clearer if the authors explicitly state which prompting variants were used for each model family in the main results table.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on CMMLU. We agree that explicit validation statistics are necessary to support the central performance claims and will strengthen the dataset construction section accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction): The manuscript describes internal question curation and filtering but supplies no inter-annotator agreement statistics, external expert review results, or contamination checks against common Chinese web corpora. Because the central claim (LLMs <50% vs. 25% random) requires that items are unambiguous and representative, this omission is load-bearing and must be addressed with concrete validation numbers before the performance gap can be confidently attributed to model limitations rather than benchmark noise.
Authors: We agree that these details are important for establishing benchmark quality. The original curation involved multiple native Chinese speakers with subject expertise who independently reviewed and filtered questions for clarity and correctness, followed by a final consistency pass. In the revised manuscript we will add: (1) inter-annotator agreement statistics (Cohen’s kappa and percentage agreement) computed on a random sample of 500 questions; (2) a summary of external expert review by three university professors in relevant disciplines who verified a stratified subset of 300 items; and (3) contamination analysis results obtained by searching the questions against publicly available Chinese web corpora (Baidu Baike, Zhihu, and Common Crawl Chinese subsets) using both exact-match and fuzzy similarity thresholds, reporting the fraction of items that appear in training data sources. These additions will be placed in a new subsection of §3 and will include the concrete numerical results. revision: yes
Circularity Check
No circularity: direct empirical benchmark measurement
full rationale
The paper constructs CMMLU as a new Chinese-language multiple-choice benchmark and reports raw accuracies of 18 LLMs under standard prompting regimes against an external random baseline of 25%. No equations, fitted parameters, or derived predictions appear; the central claim is a straightforward empirical measurement. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the results. The work is therefore self-contained and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Accuracy is an appropriate metric for multiple-choice knowledge and reasoning tasks
- domain assumption The selected subjects and questions are representative of Chinese-language understanding
Forward citations
Cited by 18 Pith papers
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
-
Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
ProHist-Bench shows that even state-of-the-art LLMs struggle with complex historical research questions requiring evidentiary reasoning, based on 400 questions and 10,891 rubrics from the Keju system.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
-
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.