Evaluating the Performance of Large Language Models on GAOKAO Benchmark
Pith reviewed 2026-05-17 12:23 UTC · model grok-4.3
pith:RLI6ARNP Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{RLI6ARNP}
Prints a linked pith:RLI6ARNP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Large language models achieve competitive scores on the Chinese GAOKAO exam but vary widely by subject.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using zero-shot prompting on GAOKAO questions, LLMs such as GPT-4 obtain converted total scores that are competitive with human performance on the Chinese college entrance examination, yet they display significant performance disparities across different subjects. Additionally, when LLMs are used to grade subjective questions, their assigned scores show a moderate level of consistency with those given by human evaluators.
What carries the argument
GAOKAO-Bench, which applies real exam questions from the Chinese GAOKAO to large language models under zero-shot conditions followed by human scoring to produce comparable total marks.
If this is right
- LLMs demonstrate the ability to address both multiple-choice and open-ended questions typical of high-stakes standardized tests.
- Subject-specific gaps indicate that models are stronger in areas like language and literature but weaker in mathematics or sciences.
- LLM-based grading of subjective responses achieves enough consistency to serve as a supplementary evaluation tool.
- Future models can be tested against this benchmark to track progress toward human-level exam performance.
Where Pith is reading between the lines
- This benchmark could be extended to other national exams to compare LLM capabilities across cultures.
- Disparities across subjects point to the need for targeted training data in weaker areas like quantitative reasoning.
- Moderate consistency in grading suggests LLMs might assist teachers but require human oversight for final decisions.
Load-bearing premise
That zero-shot answers from LLMs can be graded using the same criteria and standards applied to human student exam papers.
What would settle it
If human graders score a set of real GAOKAO student answers and a matched set of LLM-generated answers on the same questions, and the LLM set receives substantially lower average scores or fails to reach competitive totals.
read the original abstract
Large Language Models(LLMs) have demonstrated remarkable performance across various natural language processing tasks; however, how to comprehensively and accurately assess their performance becomes an urgent issue to be addressed. This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples, including both subjective and objective questions. To align with human examination methods, we design a method based on zero-shot settings to evaluate the performance of LLMs. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.Our findings reveal that LLMs have achieved competitive scores in Chinese GAOKAO examination, while they exhibit significant performance disparities across various subjects. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores. In conclusion, this research contributes a robust evaluation benchmark for future large language models and offers valuable insights into the advantages and limitations of such models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GAOKAO-Bench, a benchmark built from Chinese Gaokao examination questions (both subjective and objective), and evaluates LLMs including GPT-4, ChatGPT and ERNIE-Bot under zero-shot prompting. Human graders convert model outputs to total scores; the central claims are that LLMs reach competitive overall performance while showing large subject-wise disparities, and that LLM-generated grades for subjective items exhibit moderate agreement with human grades.
Significance. If the human-evaluation protocol proves reliable and the zero-shot responses can be fairly compared to student answers, the benchmark supplies a high-stakes, multi-subject, non-English test bed that complements existing English-centric evaluations and could expose gaps in factual recall, reasoning, and Chinese-language proficiency that standard NLP tasks miss.
major comments (2)
- [Abstract / Evaluation] Abstract and Evaluation section: the headline claim that LLMs 'have achieved competitive scores' rests on converted total scores obtained via human evaluation, yet the manuscript supplies neither sample sizes per subject, the precise scoring rubrics applied by graders, inter-rater reliability statistics, nor any statistical tests comparing model scores to human baselines; without these the competitiveness assertion cannot be verified.
- [Evaluation / Human grading protocol] The central comparison of zero-shot LLM outputs to human Gaokao performance assumes graders apply identical standards to concise, unpracticed model responses as to answers written by students who have studied the full curriculum; the paper does not report whether the rubric explicitly instructs evaluators to discount response fluency, length, or absence of exam-specific strategies, introducing a systematic risk that scores reflect surface features rather than subject mastery.
minor comments (2)
- [Abstract] The abstract asserts 'significant performance disparities across various subjects' but does not preview which subjects or supply even summary quantitative differences; a table or figure reference would improve clarity.
- [Methods] Notation for 'converted total score' is introduced without an explicit formula or conversion table showing how raw human grades map to the final scale used for comparison with official Gaokao cut-offs.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to improve transparency around our evaluation protocol and human grading process.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: the headline claim that LLMs 'have achieved competitive scores' rests on converted total scores obtained via human evaluation, yet the manuscript supplies neither sample sizes per subject, the precise scoring rubrics applied by graders, inter-rater reliability statistics, nor any statistical tests comparing model scores to human baselines; without these the competitiveness assertion cannot be verified.
Authors: We agree that these details are necessary to substantiate the competitiveness claim. In the revised version we will add the number of questions evaluated per subject, reproduce the full scoring rubrics supplied to graders, report inter-rater reliability (e.g., Cohen’s kappa), and include basic statistical comparisons such as confidence intervals or significance tests against human baselines. These additions will appear in the Evaluation section and the appendix. revision: yes
-
Referee: [Evaluation / Human grading protocol] The central comparison of zero-shot LLM outputs to human Gaokao performance assumes graders apply identical standards to concise, unpracticed model responses as to answers written by students who have studied the full curriculum; the paper does not report whether the rubric explicitly instructs evaluators to discount response fluency, length, or absence of exam-specific strategies, introducing a systematic risk that scores reflect surface features rather than subject mastery.
Authors: This concern is valid. Our graders were instructed to apply standard Gaokao content-based rubrics, but the manuscript does not document this explicitly. We will revise the Evaluation section to quote the precise grading instructions, which direct evaluators to score factual accuracy, reasoning, and completeness while ignoring fluency, length, and exam-specific tactics. We will also note this as a methodological limitation. revision: yes
Circularity Check
No circularity: direct empirical benchmark with no derivations or fitted predictions
full rationale
This is a straightforward empirical evaluation paper that introduces GAOKAO-Bench, applies zero-shot prompting to LLMs, and reports human-graded scores across subjects. No equations, parameter fits, predictions, or self-citation chains are present that reduce any claimed result to quantities defined inside the paper. The central findings (competitive LLM scores with subject disparities) are obtained by direct measurement against an external exam, rendering the work self-contained with no internal circular reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Zero-shot prompting on exam questions produces responses whose quality can be scored in the same manner as human exam answers.
Forward citations
Cited by 18 Pith papers
-
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
-
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.
-
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
-
RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)
RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
-
SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
Disjoint SFT and GRPO data for autoformalization yields up to 10.4pp semantic accuracy gains over full overlap, which renders the GRPO stage redundant.
-
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
-
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Reference graph
Works this paper leans on
-
[1]
A survey on evaluation of large language mod- els. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt
-
[2]
Measuring massive multitask language under- standing. In International Conference on Learning Representations. Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, 7 Maosong Sun, and Junxian He. 2023. C-eval: A multi-level multi-discipline chinese evaluation suite for fo...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Training language models to follow instruc- tions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, A...
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
加强管理:对凤堰古梯田保护区内的 游客进行管理,设置必要的警示标志, 禁止破坏梯田、采摘植物等行为。同 时,加强对古建筑民居群落、古寨堡、 古庙宇、古堰渠、古塘坝等文物遗存的 保护,防止游客在参观过程中对这些文 物遗存造成损害。
-
[5]
推广科普:在凤堰古梯田保护区内设 置科普展板,向游客介绍梯田的历史、 文化和生态环境,提高游客的文化素养 和环保意识,减少游客对梯田的破坏。
-
[6]
限流措施:为了 保护梯田的生态环 境,可以设置游客限流措施,限制游客 进入梯田的范围和数量,减少游客对梯 田的踩踏和破坏。
-
[7]
建设配套设施:为了方便游客参观, 可以在凤堰古梯田保护区内建设一些配 套设施,如停车场、休息区、卫生间 等,提高游客的参观体验。同时,这些 设施也需要合理规划,避免对梯田的破 坏。 【Analysis】 The Fengyan Ancient Terraces are an important cultural heritage and have been listed as a national key cultural relics protection unit. In order to protect this heritage while also developing tourism activities, measures need to be taken to protect the to...
-
[8]
Strengthen Management: Manage the tourists within the Fengyan Ancient Terraces conservation area, set up necessary warning signs, and prohibit actions such as damaging the terraces or picking plants. At the same time, enhance the protection of cultural relics such as ancient residential buildings, ancient fortresses, ancient temples, ancient canals, and a...
-
[9]
Promote Science Education: Install educational boards within the Fengyan Ancient Terraces conservation area to introduce tourists to the history, culture, and ecological environment of the terraces. This will enhance the cultural literacy and environmental awareness of the tourists and reduce the damage to the terraces
-
[10]
Implement Visitor Limitations: To protect the ecological environment of the terraces, visitor limitations can be set to restrict the range and number of tourists entering the terraces, thus reducing trampling and damage
-
[11]
At the same time, these facilities should be planned judiciously to avoid damage to the terraces
Develop Supporting Facilities: To facilitate visitor access, some supporting facilities such as parking lots, rest areas, and restrooms can be constructed within the Fengyan Ancient Terraces conservation area, enhancing the visitor experience. At the same time, these facilities should be planned judiciously to avoid damage to the terraces. A.3 Model Gradi...
-
[12]
加强管理:学生提出了对游客进行 管理和设置警示标志,以及加强对文物 遗存的保护,这与标准答案中的“完善 景区规划、依法保护生态环境”和“制定 景区相关环境保护条例与惩罚制度”相 符,可以给分。【得分】3分
-
[13]
普及旅游文化环境保 护教育,提高游客对旅游资源环境保护 的意识
推广科普:学生提出了设置科普展板 来提高游客的文化素养和环保意识,这 与标准答案中的“普及旅游文化环境保 护教育,提高游客对旅游资源环境保护 的意识”相符,可以给分。【得分】3分
-
[14]
评定该‘生 态博物馆’的环境容量,对人口数量的 容纳程度,限制客流量
限流措施:学生提出了设置游客限 流措施,这与标准答案中的“评定该‘生 态博物馆’的环境容量,对人口数量的 容纳程度,限制客流量”相符,可以给 分。【得分】2分
-
[15]
建设配套设施:学生提出了建设配 套设施并合理规划,这与标准答案中 的“尽可能保证新建设施与景区景观相 融合”相符,可以给分。【得分】2分 【总分】10分 【Scoring Rationale】 11
-
[16]
Strengthen Management: The student proposed managing tourists and setting up warning signs, as well as enhancing the protection of cultural relics, which aligns with the standard answer’s "improve the planning of the scenic area, protect the ecological environment in accordance with the law" and "develop environmental protection regulations and punishment...
-
[17]
Promote Science Education: The student proposed setting up educational boards to enhance tourists’ cultural literacy and envi- ronmental awareness, which aligns with the standard answer’s "popularize education on the protection of the tourism cultural environment, raise tourists’ awareness of the protection of tourism resources and environment." Points ca...
-
[18]
Implement Visitor Limitations: The stu- dent proposed visitor limitation measures, which align with the standard answer’s "assess the environmental capacity of this ‘Ecological Museum’, regulate the carrying capacity in terms of population, limit the flow of visitors." Points can be awarded for this. 【Points Awarded】2 points
-
[19]
ensure new facilities blend harmoniously with the scenic landscape
Develop Supporting Facilities: The student proposed constructing supporting facilities and planning them reasonably, which aligns with the standard answer’s "ensure new facilities blend harmoniously with the scenic landscape." Points can be awarded for this. 【Points Awarded】2 points 【Total Points】10 points B Converted Total Scores for Each Subject The con...
work page 2069
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.