arxiv: 2306.09212 · v2 · submitted 2023-06-15 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li , Yixuan Zhang , Fajri Koto , Yifei Yang , Hai Zhao , Yeyun Gong , Nan Duan , Timothy Baldwin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords CMMLU benchmarklarge language modelsChinese evaluationmultitask understandingLLM performancein-context learning

0 comments

The pith

Most large language models score below 50 percent on a new Chinese multitask understanding benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CMMLU, a benchmark for evaluating large language models on Chinese tasks in natural sciences, social sciences, engineering, and humanities. It tests 18 advanced models and finds that average accuracy stays under 50 percent despite in-context examples and chain-of-thought prompts, far above the 25 percent random baseline. The work highlights the need for better Chinese capabilities in LLMs and suggests paths for improvement.

Core claim

The central discovery is that CMMLU exposes significant deficiencies in LLMs' Chinese knowledge and reasoning, as most models cannot achieve 50 percent accuracy on this broad set of subjects even with advanced prompting techniques.

What carries the argument

CMMLU, a massive multitask benchmark with questions in Chinese designed to assess knowledge and reasoning across many subjects.

If this is right

There is significant room for improvement in LLMs regarding Chinese language processing.
Standard prompting methods do not fully address the performance shortfalls.
Experiments on impacting factors can inform future model enhancements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Creating similar benchmarks for other languages would help identify if this is a general multilingual issue.
Models specifically trained or fine-tuned on Chinese data may show better results on this benchmark.
This evaluation framework could be used to measure progress in future iterations of Chinese LLMs.

Load-bearing premise

The questions in CMMLU accurately represent the knowledge and reasoning demands of real Chinese-language tasks across the covered subjects.

What would settle it

If an LLM achieves high accuracy on CMMLU but performs poorly on authentic Chinese language applications outside the benchmark, this would falsify the assumption that the benchmark measures relevant capabilities.

read the original abstract

As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces CMMLU, a new Chinese-language benchmark covering natural sciences, social sciences, engineering, and humanities. It reports direct accuracy measurements on 18 multilingual and Chinese-oriented LLMs across standard, few-shot, and chain-of-thought prompting settings, finding that most models remain below 50% average accuracy (versus a 25% random baseline) and identifying factors that affect performance.

Significance. If the questions are valid, the work supplies a needed Chinese counterpart to MMLU with reproducible multi-model, multi-setting measurements that credibly document a substantial performance gap. The empirical focus and exploration of prompting factors are strengths that would support acceptance once validation details are added.

major comments (1)

[§3] §3 (Dataset Construction): The manuscript describes internal question curation and filtering but supplies no inter-annotator agreement statistics, external expert review results, or contamination checks against common Chinese web corpora. Because the central claim (LLMs <50% vs. 25% random) requires that items are unambiguous and representative, this omission is load-bearing and must be addressed with concrete validation numbers before the performance gap can be confidently attributed to model limitations rather than benchmark noise.

minor comments (2)

[Table 2] Table 2 and Figure 3: Per-subject accuracy tables would benefit from reporting the exact number of questions per category and any run-to-run standard deviations to allow readers to assess statistical reliability of the reported gaps.
[§5] §5 (Analysis): The discussion of factors impacting performance is useful but would be clearer if the authors explicitly state which prompting variants were used for each model family in the main results table.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on CMMLU. We agree that explicit validation statistics are necessary to support the central performance claims and will strengthen the dataset construction section accordingly.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The manuscript describes internal question curation and filtering but supplies no inter-annotator agreement statistics, external expert review results, or contamination checks against common Chinese web corpora. Because the central claim (LLMs <50% vs. 25% random) requires that items are unambiguous and representative, this omission is load-bearing and must be addressed with concrete validation numbers before the performance gap can be confidently attributed to model limitations rather than benchmark noise.

Authors: We agree that these details are important for establishing benchmark quality. The original curation involved multiple native Chinese speakers with subject expertise who independently reviewed and filtered questions for clarity and correctness, followed by a final consistency pass. In the revised manuscript we will add: (1) inter-annotator agreement statistics (Cohen’s kappa and percentage agreement) computed on a random sample of 500 questions; (2) a summary of external expert review by three university professors in relevant disciplines who verified a stratified subset of 300 items; and (3) contamination analysis results obtained by searching the questions against publicly available Chinese web corpora (Baidu Baike, Zhihu, and Common Crawl Chinese subsets) using both exact-match and fuzzy similarity thresholds, reporting the fraction of items that appear in training data sources. These additions will be placed in a new subsection of §3 and will include the concrete numerical results. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark measurement

full rationale

The paper constructs CMMLU as a new Chinese-language multiple-choice benchmark and reports raw accuracies of 18 LLMs under standard prompting regimes against an external random baseline of 25%. No equations, fitted parameters, or derived predictions appear; the central claim is a straightforward empirical measurement. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the results. The work is therefore self-contained and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the creation of a new question set and the use of standard accuracy metrics; no free parameters or invented physical entities are introduced.

axioms (2)

standard math Accuracy is an appropriate metric for multiple-choice knowledge and reasoning tasks
Invoked throughout the evaluation section of the abstract
domain assumption The selected subjects and questions are representative of Chinese-language understanding
Implicit in the claim that CMMLU measures massive multitask understanding

pith-pipeline@v0.9.0 · 5471 in / 1213 out tokens · 21266 ms · 2026-05-16T20:58:16.649845+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
cs.CL 2026-05 unverdicted novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
cs.CL 2026-04 unverdicted novelty 7.0

ProHist-Bench shows that even state-of-the-art LLMs struggle with complex historical research questions requiring evidentiary reasoning, based on 400 questions and 10,891 rubrics from the Keju system.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
cs.CL 2026-04 unverdicted novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
Reasoning Structure Matters for Safety Alignment of Reasoning Models
cs.AI 2026-04 unverdicted novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
cs.LG 2025-12 conditional novelty 6.0

LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
cs.CL 2025-02 unverdicted novelty 6.0

NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
cs.CL 2024-12 unverdicted novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
cs.CL 2024-01 unverdicted novelty 5.0

DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.