Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Haoxiang Sun; Ji-Rong Wen; Wayne Xin Zhao; Yingqian Min; Zhipeng Chen

arxiv: 2503.21380 · v3 · submitted 2025-03-27 · 💻 cs.CL

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Haoxiang Sun , Yingqian Min , Zhipeng Chen , Wayne Xin Zhao , Ji-Rong Wen This is my paper

Pith reviewed 2026-05-22 22:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords olympiad math benchmarklarge language modelsmathematical reasoning evaluationlean formalizationdual evaluationdata contamination preventionnumerical answer assessment

0 comments

The pith

OlymMATH unifies numerical answer evaluation and Lean formal verification in a 350-problem Olympiad math benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OlymMATH as a new benchmark to evaluate large language models on Olympiad-level math problems that exceed the difficulty of existing benchmarks. It creates parallel English and Chinese versions of 350 problems drawn from printed publications. The benchmark splits into 200 problems for natural language evaluation based on numerical answers and 150 problems formalized in Lean 4 for checking the reasoning process. Experiments show models struggle significantly and sometimes use guessing strategies instead of proper reasoning, with performance varying by language.

Core claim

OlymMATH is the first benchmark to unify dual evaluation paradigms within a single suite: natural language evaluation through OlymMATH-EASY and OlymMATH-HARD, comprising 200 computational problems with numerical answers for objective rule-based assessment, and formal verification through OlymMATH-LEAN, offering 150 problems formalized in Lean 4 for rigorous process-level evaluation.

What carries the argument

The OlymMATH benchmark with EASY and HARD subsets for natural language numerical evaluation plus the LEAN subset for formal verification in Lean 4.

If this is right

Models exhibit significant performance drops on this benchmark compared to prior ones.
Consistent performance gaps appear between English and Chinese versions of the same problems.
Analysis identifies cases where models employ heuristic guessing rather than rigorous reasoning.
The released 582k reasoning trajectories and visualization tool enable further study of model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual evaluation setup could be applied to other domains to test whether models perform genuine reasoning or pattern matching.
Integration of formal verification tools with language models might become a standard way to improve proof reliability.
If models show steady gains on OlymMATH over time, it would indicate measurable progress in handling complex mathematical tasks.
Releasing parallel language versions allows direct study of how training data language affects reasoning consistency.

Load-bearing premise

The problems manually sourced from printed publications are absent from model training data and expert verification plus Lean formalization produces reliable test items that distinguish reasoning from guessing.

What would settle it

A model achieving high scores on the Lean subset while producing invalid or incomplete formal proofs would show the benchmark does not enforce process-level evaluation.

Figures

Figures reproduced from arXiv: 2503.21380 by Haoxiang Sun, Ji-Rong Wen, Wayne Xin Zhao, Yingqian Min, Zhipeng Chen.

**Figure 2.** Figure 2: Examples from the MATH dataset and our OlymMATH dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pass@1 accuracy on OlymMATH EN (y) vs. ZH (x), the dashed line shows parity. Points above favor English, below favor Chinese. Solid circles (local dense models, colored by size) indicate larger models trend towards higher accuracy. Hollow diamonds are MoE or API evaluated models. Gemini 2.5 Pro Exp Gemini 2.5 Pro Exp o3-mini (high) o3-mini (high) Qwen3-235B-A22B Qwen3-235B-A22B Qwen3-30B-A3B Qwen3-30B-A3B … view at source ↗

**Figure 4.** Figure 4: Correlation of Pass@1 performance: OlymMATH-EN vs. AIME24. Dashed lines indicate linear trends per dataset. Solid shapes are local dense models (size = model size, color = release date). Hollow shapes denote MoE or API evaluated models. Stars mark the best overall model. HARD subset, are selected and designed so that their reasoning steps are difficult to “hack” through empirical guessing, thus providing a… view at source ↗

**Figure 5.** Figure 5: The OlymMATH-demo interface. It is currently being maintained on HuggingFace Spaces. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: A geometry problem described precisely in text from OlymMATH. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: An OlymMATH-HARD example testing model’s identification of all possible answers. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: This boxplot shows that our EASY dataset has AIME-level difficulty with a wider distribu [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: An example during our data collection. o3-mini (high) found the correct answer without [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: An example from AIME 2025. o3-mini (high) forgot to prove that [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: An example from Omni-MATH. The solution provided by Omni-MATH itself is flawed [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: An example from OlymMATH-EN-HARD subset. o3-mini (high) attempted to “guess” [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

read the original abstract

The rapid advancement of large reasoning models has saturated existing math benchmarks, underscoring the urgent need for more challenging evaluation frameworks. To address this, we introduce OlymMATH, a rigorously curated, Olympiad-level math benchmark comprising 350 problems, each with parallel English and Chinese versions. OlymMATH is the first benchmark to unify dual evaluation paradigms within a single suite: (1) natural language evaluation through OlymMATH-EASY and OlymMATH-HARD, comprising 200 computational problems with numerical answers for objective rule-based assessment, and (2) formal verification through OlymMATH-LEAN, offering 150 problems formalized in Lean 4 for rigorous process-level evaluation. All problems are manually sourced from printed publications to minimize data contamination, verified by experts, and span four core domains. Extensive experiments reveal the benchmark's significant challenge, and our analysis also uncovers consistent performance gaps between languages and identifies cases where models employ heuristic "guessing" rather than rigorous reasoning. To further support community research, we release 582k+ reasoning trajectories, a visualization tool, and expert solutions at https://github.com/RUCAIBox/OlymMATH.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OlymMATH combines bilingual Olympiad problems with numerical answers and Lean verification in one suite, sourced from print to cut contamination risk.

read the letter

OlymMATH stands out for putting 200 numerical-answer problems and 150 Lean 4 formalized problems into the same 350-problem bilingual set. That dual evaluation in one benchmark is the concrete step beyond earlier single-paradigm collections. Manual sourcing from printed publications plus expert checks is a practical move that directly targets data leakage, and releasing the trajectories and visualization tool gives others usable material right away. The split into EASY and HARD subsets plus coverage of four domains also looks reasonable for probing different reasoning depths. The experimental side is thinner in what we have. The abstract flags big performance gaps and heuristic guessing, but the lack of detailed baselines, error bars, or contamination audits in the summary makes it hard to weigh how well those claims hold. Lean formalization adds rigor in principle, yet any translation slips could weaken the verification side, and the language-gap observations would benefit from more breakdown to show they track reasoning rather than surface effects. This paper is aimed at groups building or testing LLM math systems who need harder, less contaminated items with both answer and proof checks. A reader focused on evaluation design would get direct value from the construction choices and released assets. It deserves a serious referee because the unified setup is worth checking in full even if the results section needs tightening.

Referee Report

0 major / 2 minor

Summary. The paper introduces OlymMATH, a benchmark of 350 Olympiad-level math problems with parallel English and Chinese versions. It unifies two evaluation paradigms in one suite: natural-language evaluation on OlymMATH-EASY and OlymMATH-HARD (200 computational problems with numerical answers for rule-based scoring) and formal verification on OlymMATH-LEAN (150 problems formalized in Lean 4 for process-level checking). All items were manually sourced from printed publications, expert-verified, and drawn from four core domains. Experiments are reported to demonstrate that the benchmark remains challenging for current models, to document consistent cross-language performance gaps, and to identify instances of heuristic guessing rather than rigorous reasoning. The authors additionally release 582k+ reasoning trajectories, a visualization tool, and expert solutions.

Significance. If the reported results and sourcing claims hold, OlymMATH supplies a timely, higher-difficulty resource that combines outcome-based and process-based evaluation within a single, contamination-resistant collection. The open release of trajectories and tooling directly supports reproducibility and follow-on analysis of reasoning failures. These elements address saturation in existing math benchmarks and provide a concrete testbed for distinguishing genuine reasoning from surface heuristics.

minor comments (2)

The abstract states that problems 'span four core domains' but does not enumerate them; the introduction or §2 should list the domains explicitly for clarity.
The claim that OlymMATH is 'the first benchmark to unify dual evaluation paradigms' would be strengthened by a concise comparison table against the closest prior suites (e.g., those that separately offer Lean formalizations or numerical-answer sets).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of OlymMATH and for recommending minor revision. The recognition that the benchmark addresses saturation in existing math evaluations through its bilingual design, dual evaluation paradigms, and contamination-resistant sourcing is appreciated. We note the recommendation and will incorporate any minor clarifications in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark paper that introduces OlymMATH by manually sourcing 350 problems from printed publications, followed by expert verification and Lean formalization. No derivations, equations, fitted parameters, predictions, or self-citations are used to justify any result; performance gaps and contamination claims rest on external sourcing and testing rather than internal construction. The central claim of unifying evaluation paradigms is realized through the benchmark's construction itself and does not reduce to any self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper rather than a theoretical derivation; therefore the ledger contains no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5746 in / 1151 out tokens · 81455 ms · 2026-05-22T22:10:36.851920+00:00 · methodology

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
cs.AI 2026-04 accept novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
MathDuels: Evaluating LLMs as Problem Posers and Solvers
cs.CL 2026-04 unverdicted novelty 7.0

Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration
cs.LG 2026-04 unverdicted novelty 7.0

NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism
cs.LO 2026-04 unverdicted novelty 7.0

ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
cs.AI 2025-05 unverdicted novelty 7.0

MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
Unified Data Selection for LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
cs.LG 2026-04 unverdicted novelty 6.0

Entrocraft uses rejection sampling to enforce custom entropy curves in LLM RL, sustaining longer training, better generalization, and higher output diversity than prior regularization approaches.
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
cs.LG 2026-04 unverdicted novelty 6.0

Entrocraft uses rejection sampling to enforce precise entropy schedules in LLM RL by biasing advantages, enabling longer training, better generalization, and higher performance than baselines.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
cs.CV 2026-04 unverdicted novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
TEMPO: Scaling Test-time Training for Large Reasoning Models
cs.LG 2026-04 unverdicted novelty 6.0

TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
cs.LG 2026-04 unverdicted novelty 6.0

PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.
Riemann-Bench: A Benchmark for Moonshot Mathematics
cs.AI 2026-04 conditional novelty 5.0

Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 12 Pith papers · 5 internal anchors

[1]

A survey of large language models, 2025

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2025

work page 2025
[2]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Openai o1 system card, 2024

OpenAI. Openai o1 system card, 2024

work page 2024
[6]

An empirical study on eliciting and improving r1-like reasoning models, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025

work page 2025
[7]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

work page 2024
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021

work page 2021
[10]

Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024

Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou. Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024

work page 2024
[11]

Have llms advanced enough? a challeng- ing problem solving benchmark for large language models, 2023

Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challeng- ing problem solving benchmark for large language models, 2023

work page 2023
[12]

Openai o3-mini: Pushing the frontier of cost-effective reasoning, 1 2025

OpenAI. Openai o3-mini: Pushing the frontier of cost-effective reasoning, 1 2025

work page 2025
[13]

Gemini 2.5: Our most intelligent ai model, 3 2025

Google Deepmind. Gemini 2.5: Our most intelligent ai model, 3 2025

work page 2025
[14]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, 2024

Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, 2024

work page 2024
[15]

Aime 2024, 2024

Mathematical Association of America. Aime 2024, 2024

work page 2024
[16]

Aime 2025, 2025

Mathematical Association of America. Aime 2025, 2025. 10

work page 2025
[17]

Hmmt 202502, 2025

HMMT. Hmmt 202502, 2025

work page 2025
[18]

Usamo 2025, 2025

Mathematical Association of America. Usamo 2025, 2025

work page 2025
[19]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for C...

work page 2024
[20]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning

RUCAIBox STILL Team. Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning. 2025

work page 2025
[22]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

work page 2025
[23]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

work page 2025
[24]

Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025

Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025

work page 2025
[25]

Open Thoughts

OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

work page 2025
[26]

Skywork open reasoner series, 2025

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Yang Liu, and Yahui Zhou. Skywork open reasoner series, 2025. Notion Blog

work page 2025
[27]

Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024
[28]

Acemath: Advancing frontier math reasoning with post-training and reward modeling

Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling. arXiv preprint, 2024

work page 2024
[29]

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025

work page 2025
[30]

many” iterations the answer to x2025 = m n (in lowest terms) satisfies m + n ≡ 248 (mod 1000) . In what follows we briefly describe one way (via a “miraculous

Qwen Team. Qwen3, April 2025. 11 A Appendix This part presents the detailed content of the dataset and the case study examples mentioned before. Problem: Given that two vertices of an equilateral triangle are on the parabola y2 = 4x, and the third vertex is on the directrix of the parabola, and the distance from the center of the triangle to the directrix...

work page 2025

[1] [1]

A survey of large language models, 2025

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2025

work page 2025

[2] [2]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Openai o1 system card, 2024

OpenAI. Openai o1 system card, 2024

work page 2024

[6] [6]

An empirical study on eliciting and improving r1-like reasoning models, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025

work page 2025

[7] [7]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

work page 2024

[8] [8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021

work page 2021

[10] [10]

Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024

Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou. Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024

work page 2024

[11] [11]

Have llms advanced enough? a challeng- ing problem solving benchmark for large language models, 2023

Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challeng- ing problem solving benchmark for large language models, 2023

work page 2023

[12] [12]

Openai o3-mini: Pushing the frontier of cost-effective reasoning, 1 2025

OpenAI. Openai o3-mini: Pushing the frontier of cost-effective reasoning, 1 2025

work page 2025

[13] [13]

Gemini 2.5: Our most intelligent ai model, 3 2025

Google Deepmind. Gemini 2.5: Our most intelligent ai model, 3 2025

work page 2025

[14] [14]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, 2024

Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, 2024

work page 2024

[15] [15]

Aime 2024, 2024

Mathematical Association of America. Aime 2024, 2024

work page 2024

[16] [16]

Aime 2025, 2025

Mathematical Association of America. Aime 2025, 2025. 10

work page 2025

[17] [17]

Hmmt 202502, 2025

HMMT. Hmmt 202502, 2025

work page 2025

[18] [18]

Usamo 2025, 2025

Mathematical Association of America. Usamo 2025, 2025

work page 2025

[19] [19]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for C...

work page 2024

[20] [20]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning

RUCAIBox STILL Team. Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning. 2025

work page 2025

[22] [22]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

work page 2025

[23] [23]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

work page 2025

[24] [24]

Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025

Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025

work page 2025

[25] [25]

Open Thoughts

OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

work page 2025

[26] [26]

Skywork open reasoner series, 2025

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Yang Liu, and Yahui Zhou. Skywork open reasoner series, 2025. Notion Blog

work page 2025

[27] [27]

Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024

[28] [28]

Acemath: Advancing frontier math reasoning with post-training and reward modeling

Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling. arXiv preprint, 2024

work page 2024

[29] [29]

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025

work page 2025

[30] [30]

many” iterations the answer to x2025 = m n (in lowest terms) satisfies m + n ≡ 248 (mod 1000) . In what follows we briefly describe one way (via a “miraculous

Qwen Team. Qwen3, April 2025. 11 A Appendix This part presents the detailed content of the dataset and the case study examples mentioned before. Problem: Given that two vertices of an equilateral triangle are on the parabola y2 = 4x, and the third vertex is on the directrix of the parabola, and the distance from the center of the triangle to the directrix...

work page 2025