Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
Pith reviewed 2026-05-22 22:10 UTC · model grok-4.3
The pith
OlymMATH unifies numerical answer evaluation and Lean formal verification in a 350-problem Olympiad math benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OlymMATH is the first benchmark to unify dual evaluation paradigms within a single suite: natural language evaluation through OlymMATH-EASY and OlymMATH-HARD, comprising 200 computational problems with numerical answers for objective rule-based assessment, and formal verification through OlymMATH-LEAN, offering 150 problems formalized in Lean 4 for rigorous process-level evaluation.
What carries the argument
The OlymMATH benchmark with EASY and HARD subsets for natural language numerical evaluation plus the LEAN subset for formal verification in Lean 4.
If this is right
- Models exhibit significant performance drops on this benchmark compared to prior ones.
- Consistent performance gaps appear between English and Chinese versions of the same problems.
- Analysis identifies cases where models employ heuristic guessing rather than rigorous reasoning.
- The released 582k reasoning trajectories and visualization tool enable further study of model behavior.
Where Pith is reading between the lines
- The dual evaluation setup could be applied to other domains to test whether models perform genuine reasoning or pattern matching.
- Integration of formal verification tools with language models might become a standard way to improve proof reliability.
- If models show steady gains on OlymMATH over time, it would indicate measurable progress in handling complex mathematical tasks.
- Releasing parallel language versions allows direct study of how training data language affects reasoning consistency.
Load-bearing premise
The problems manually sourced from printed publications are absent from model training data and expert verification plus Lean formalization produces reliable test items that distinguish reasoning from guessing.
What would settle it
A model achieving high scores on the Lean subset while producing invalid or incomplete formal proofs would show the benchmark does not enforce process-level evaluation.
Figures
read the original abstract
The rapid advancement of large reasoning models has saturated existing math benchmarks, underscoring the urgent need for more challenging evaluation frameworks. To address this, we introduce OlymMATH, a rigorously curated, Olympiad-level math benchmark comprising 350 problems, each with parallel English and Chinese versions. OlymMATH is the first benchmark to unify dual evaluation paradigms within a single suite: (1) natural language evaluation through OlymMATH-EASY and OlymMATH-HARD, comprising 200 computational problems with numerical answers for objective rule-based assessment, and (2) formal verification through OlymMATH-LEAN, offering 150 problems formalized in Lean 4 for rigorous process-level evaluation. All problems are manually sourced from printed publications to minimize data contamination, verified by experts, and span four core domains. Extensive experiments reveal the benchmark's significant challenge, and our analysis also uncovers consistent performance gaps between languages and identifies cases where models employ heuristic "guessing" rather than rigorous reasoning. To further support community research, we release 582k+ reasoning trajectories, a visualization tool, and expert solutions at https://github.com/RUCAIBox/OlymMATH.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OlymMATH, a benchmark of 350 Olympiad-level math problems with parallel English and Chinese versions. It unifies two evaluation paradigms in one suite: natural-language evaluation on OlymMATH-EASY and OlymMATH-HARD (200 computational problems with numerical answers for rule-based scoring) and formal verification on OlymMATH-LEAN (150 problems formalized in Lean 4 for process-level checking). All items were manually sourced from printed publications, expert-verified, and drawn from four core domains. Experiments are reported to demonstrate that the benchmark remains challenging for current models, to document consistent cross-language performance gaps, and to identify instances of heuristic guessing rather than rigorous reasoning. The authors additionally release 582k+ reasoning trajectories, a visualization tool, and expert solutions.
Significance. If the reported results and sourcing claims hold, OlymMATH supplies a timely, higher-difficulty resource that combines outcome-based and process-based evaluation within a single, contamination-resistant collection. The open release of trajectories and tooling directly supports reproducibility and follow-on analysis of reasoning failures. These elements address saturation in existing math benchmarks and provide a concrete testbed for distinguishing genuine reasoning from surface heuristics.
minor comments (2)
- The abstract states that problems 'span four core domains' but does not enumerate them; the introduction or §2 should list the domains explicitly for clarity.
- The claim that OlymMATH is 'the first benchmark to unify dual evaluation paradigms' would be strengthened by a concise comparison table against the closest prior suites (e.g., those that separately offer Lean formalizations or numerical-answer sets).
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of OlymMATH and for recommending minor revision. The recognition that the benchmark addresses saturation in existing math evaluations through its bilingual design, dual evaluation paradigms, and contamination-resistant sourcing is appreciated. We note the recommendation and will incorporate any minor clarifications in the revised manuscript.
Circularity Check
No significant circularity
full rationale
This is an empirical benchmark paper that introduces OlymMATH by manually sourcing 350 problems from printed publications, followed by expert verification and Lean formalization. No derivations, equations, fitted parameters, predictions, or self-citations are used to justify any result; performance gaps and contamination claims rest on external sourcing and testing rather than internal construction. The central claim of unifying evaluation paradigms is realized through the benchmark's construction itself and does not reduce to any self-referential step.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 13 Pith papers
-
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
-
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
-
MathDuels: Evaluating LLMs as Problem Posers and Solvers
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
-
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration
NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
-
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
-
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
-
Unified Data Selection for LLM Reasoning
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
-
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
Entrocraft uses rejection sampling to enforce custom entropy curves in LLM RL, sustaining longer training, better generalization, and higher output diversity than prior regularization approaches.
-
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
Entrocraft uses rejection sampling to enforce precise entropy schedules in LLM RL by biasing advantages, enabling longer training, better generalization, and higher performance than baselines.
-
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
-
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
-
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.
-
Riemann-Bench: A Benchmark for Moonshot Mathematics
Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
Reference graph
Works this paper leans on
-
[1]
A survey of large language models, 2025
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2025
work page 2025
-
[2]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [5]
-
[6]
An empirical study on eliciting and improving r1-like reasoning models, 2025
Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025
work page 2025
-
[7]
Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024
work page 2024
-
[8]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021
work page 2021
-
[10]
Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou. Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024
work page 2024
-
[11]
Have llms advanced enough? a challeng- ing problem solving benchmark for large language models, 2023
Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challeng- ing problem solving benchmark for large language models, 2023
work page 2023
-
[12]
Openai o3-mini: Pushing the frontier of cost-effective reasoning, 1 2025
OpenAI. Openai o3-mini: Pushing the frontier of cost-effective reasoning, 1 2025
work page 2025
-
[13]
Gemini 2.5: Our most intelligent ai model, 3 2025
Google Deepmind. Gemini 2.5: Our most intelligent ai model, 3 2025
work page 2025
-
[14]
Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, 2024
Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, 2024
work page 2024
- [15]
- [16]
- [17]
- [18]
-
[19]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for C...
work page 2024
-
[20]
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
RUCAIBox STILL Team. Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning. 2025
work page 2025
-
[22]
Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog
work page 2025
-
[23]
Qwq-32b: Embracing the power of reinforcement learning, March 2025
Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025
work page 2025
-
[24]
Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025
Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025
work page 2025
-
[25]
OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025
work page 2025
-
[26]
Skywork open reasoner series, 2025
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Yang Liu, and Yahui Zhou. Skywork open reasoner series, 2025. Notion Blog
work page 2025
-
[27]
Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...
work page 2024
-
[28]
Acemath: Advancing frontier math reasoning with post-training and reward modeling
Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling. arXiv preprint, 2024
work page 2024
-
[29]
Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025
work page 2025
-
[30]
Qwen Team. Qwen3, April 2025. 11 A Appendix This part presents the detailed content of the dataset and the case study examples mentioned before. Problem: Given that two vertices of an equilateral triangle are on the parabola y2 = 4x, and the third vertex is on the directrix of the parabola, and the distance from the center of the triangle to the directrix...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.