pith. machine review for the scientific record. sign in

arxiv: 2410.07985 · v3 · submitted 2024-10-10 · 💻 cs.CL

Recognition: no theorem link

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords Olympiad mathematicsLLM benchmarkmathematical reasoningcompetition problemslarge language modelsevaluation dataset
0
0 comments X

The pith

A new benchmark of 4428 Olympiad math problems shows even top models like o1-preview reach only 52.55% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Omni-MATH, a dataset of 4428 human-annotated competition-level mathematics problems drawn from Olympiads. These problems are grouped into more than 33 sub-domains and over 10 difficulty levels to test reasoning that exceeds current benchmarks. Evaluation reveals that OpenAI o1-mini solves 60.54% and o1-preview solves 52.55% of the problems, while older tests like MATH are now solved above 94%. The results establish that existing evaluation sets no longer distinguish advanced models on truly difficult mathematics. This benchmark supplies a finer-grained tool for measuring progress in high-level mathematical reasoning.

Core claim

The central claim is that Omni-MATH provides a comprehensive Olympiad-level benchmark consisting of 4428 rigorously annotated problems across more than 33 sub-domains and 10+ difficulty tiers, and that state-of-the-art models including o1-mini and o1-preview still achieve only 60.54% and 52.55% accuracy respectively on these problems.

What carries the argument

The Omni-MATH dataset of 4428 human-annotated Olympiad problems, organized by sub-domain and difficulty level, which serves as the evaluation instrument for model performance.

Load-bearing premise

The 4428 problems constitute a fair, unbiased, and comprehensive sample of Olympiad-level mathematics with human annotation free of selection bias or verification errors.

What would settle it

A model achieving sustained accuracy above 90% across the full set, or independent verification revealing widespread errors in problem statements or ground-truth answers.

read the original abstract

Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8\% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54\% and 52.55\% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Omni-MATH, a benchmark of 4428 Olympiad-level mathematics competition problems with rigorous human annotation, spanning over 33 sub-domains and more than 10 difficulty levels. It evaluates multiple LLMs and reports that even the strongest models (OpenAI o1-mini at 60.54% and o1-preview at 52.55%) struggle, arguing that existing benchmarks like MATH are now saturated and insufficient for testing advanced reasoning.

Significance. If the dataset proves to be a clean, uncontaminated, and accurately annotated sample of Olympiad problems, the benchmark would be a valuable contribution. It directly addresses saturation in prior datasets (e.g., 94.8% on MATH) by providing scale, sub-domain coverage, and difficulty stratification that could support fine-grained diagnosis of LLM reasoning failures at the competition level.

major comments (1)
  1. [Dataset construction and annotation section] The manuscript asserts 'rigorous human annotation' and coverage of 33 sub-domains but supplies no explicit protocol for contest sourcing, deduplication against public training corpora, inter-annotator agreement statistics, independent double-checking of solutions, or quantitative estimates of annotation error rates. These details are load-bearing for the validity of the headline accuracies, because even modest contamination or label errors would render the 60.54% / 52.55% figures unreliable indicators of reasoning limits.
minor comments (2)
  1. [Title] The title contains a grammatical error ('Mathematic Benchmark' should read 'Mathematics Benchmark').
  2. [Abstract] The abstract states 'more than 10 distinct difficulty levels' without describing the rubric or assignment procedure used to assign levels.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We agree that explicit documentation of the dataset construction and annotation process is essential to substantiate the benchmark's validity and the reliability of the reported model accuracies. We address the major comment below and will incorporate the necessary revisions to strengthen the paper.

read point-by-point responses
  1. Referee: The manuscript asserts 'rigorous human annotation' and coverage of 33 sub-domains but supplies no explicit protocol for contest sourcing, deduplication against public training corpora, inter-annotator agreement statistics, independent double-checking of solutions, or quantitative estimates of annotation error rates. These details are load-bearing for the validity of the headline accuracies, because even modest contamination or label errors would render the 60.54% / 52.55% figures unreliable indicators of reasoning limits.

    Authors: We agree that the current version of the manuscript lacks sufficient detail on these critical aspects of dataset construction. In the revised manuscript, we will expand the relevant section to explicitly describe: (1) the sourcing protocol, including the specific Olympiad contests, platforms, and selection criteria used to compile the 4428 problems; (2) the deduplication process against public training corpora, incorporating automated similarity detection, manual review, and exclusion of any overlapping items; (3) inter-annotator agreement statistics (e.g., percentage agreement and Cohen's kappa) from the multiple human annotators involved; (4) the independent double-checking protocol, where solutions were verified by additional expert mathematicians; and (5) quantitative estimates of annotation error rates based on spot-checks of a validation subset. These additions will provide the transparency needed to support the headline results and allow for a more rigorous evaluation of potential contamination or labeling issues. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark with direct measurements

full rationale

This is a data-collection and evaluation paper that introduces 4428 annotated Olympiad problems and reports model accuracies on them. There are no derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citation chains support load-bearing claims, and results are straightforward empirical measurements on held-out items. The derivation chain is empty by nature of the work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on expert curation of existing competition problems rather than new mathematical derivations or invented constructs.

axioms (1)
  • domain assumption Human experts can reliably identify and verify Olympiad-level problems without introducing selection bias
    The benchmark's validity depends entirely on the quality and representativeness of the human-annotated collection.

pith-pipeline@v0.9.0 · 5573 in / 1238 out tokens · 55134 ms · 2026-05-15T09:04:10.133306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

    cs.AI 2026-04 accept novelty 8.0

    MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...

  2. Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.

  3. Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

    cs.LG 2026-05 unverdicted novelty 7.0

    UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.

  4. OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

    cs.CL 2026-04 unverdicted novelty 7.0

    OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...

  5. Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

    cs.LO 2026-04 unverdicted novelty 7.0

    ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.

  6. MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    cs.AI 2025-05 unverdicted novelty 7.0

    MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

  7. Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

  8. TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.

  9. Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.

  10. Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.

  11. You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

    cs.CR 2026-05 unverdicted novelty 6.0

    NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...

  12. Controllable and Verifiable Process Data Synthesis for Process Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.

  13. Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

    cs.CL 2026-04 unverdicted novelty 6.0

    DCM-Agent improves LLM performance on multi-paradigm optimization problems by 11-21% via dual-cluster memory construction and dynamic inference guidance.

  14. OLLM: Options-based Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    OLLM models next-token generation as a latent-indexed set of options, enabling up to 70% math reasoning correctness versus 51% baselines and structure-based alignment via a compact latent policy.

  15. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  16. LLaDA2.0: Scaling Up Diffusion Language Models to 100B

    cs.LG 2025-12 conditional novelty 6.0

    LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

  17. Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

    cs.AI 2026-04 unverdicted novelty 5.0

    An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.

  18. Riemann-Bench: A Benchmark for Moonshot Mathematics

    cs.AI 2026-04 conditional novelty 5.0

    Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.

  19. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 19 Pith papers · 12 internal anchors

  1. [1]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  2. [2]

    2021 , eprint=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

  3. [3]

    2023 , eprint=

    Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models , author=. 2023 , eprint=

  4. [4]

    2022 , eprint=

    MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics , author=. 2022 , eprint=

  5. [5]

    2023 , eprint=

    ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. 2023 , eprint=

  6. [6]

    2024 , eprint=

    Llemma: An Open Language Model For Mathematics , author=. 2024 , eprint=

  7. [7]

    2024 , eprint=

    MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data , author=. 2024 , eprint=

  8. [8]

    2024 , eprint=

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author=. 2024 , eprint=

  9. [9]

    2024 , eprint=

    OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI , author=. 2024 , eprint=

  10. [10]

    2022 , eprint=

    Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

  11. [11]

    Nature , volume=

    Solving olympiad geometry without human demonstrations , author=. Nature , volume=. 2024 , publisher=

  12. [12]

    Hugging Face repository , howpublished =

    Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. Hugging Face repository , howpublished =. 2024 , publisher =

  13. [13]

    2024 , journal=

    Benchmarking Benchmark Leakage in Large Language Models , author=. 2024 , journal=

  14. [14]

    2023 , eprint=

    Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers , author=. 2023 , eprint=

  15. [15]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  16. [16]

    2024 , eprint=

    A Survey on In-context Learning , author=. 2024 , eprint=

  17. [17]

    2024 , eprint=

    Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? , author=. 2024 , eprint=

  18. [18]

    Journal of the American statistical Association , volume=

    Distribution of residual autocorrelations in autoregressive-integrated moving average time series models , author=. Journal of the American statistical Association , volume=. 1970 , publisher=

  19. [20]

    2023 , url=

    GPT-4 Technical Report , author=. 2023 , url=

  20. [21]

    2024 , eprint=

    Qwen2 Technical Report , author=. 2024 , eprint=

  21. [22]

    2024 , eprint=

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

  22. [23]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  23. [24]

    2024 , eprint=

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , author=. 2024 , eprint=

  24. [25]

    2024 , eprint=

    Qwen2.5-Coder Technical Report , author=. 2024 , eprint=

  25. [26]

    2024 , eprint=

    Code Llama: Open Foundation Models for Code , author=. 2024 , eprint=

  26. [27]

    2024 , eprint=

    Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation , author=. 2024 , eprint=

  27. [28]

    2024 , eprint=

    Tower: An Open Multilingual Large Language Model for Translation-Related Tasks , author=. 2024 , eprint=

  28. [29]

    2023 , eprint=

    BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models , author=. 2023 , eprint=

  29. [30]

    2024 , eprint=

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. 2024 , eprint=

  30. [31]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  31. [32]

    2023 , eprint=

    Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method , author=. 2023 , eprint=

  32. [33]

    An Iterative Optimizing Framework for Radiology Report Summarization With ChatGPT , volume=

    Ma, Chong and Wu, Zihao and Wang, Jiaqi and Xu, Shaochen and Wei, Yaonai and Liu, Zhengliang and Zeng, Fang and Jiang, Xi and Guo, Lei and Cai, Xiaoyan and Zhang, Shu and Zhang, Tuo and Zhu, Dajiang and Shen, Dinggang and Liu, Tianming and Li, Xiang , year=. An Iterative Optimizing Framework for Radiology Report Summarization With ChatGPT , volume=. IEEE ...

  33. [34]

    2024 , eprint=

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference , author=. 2024 , eprint=

  34. [35]

    2024 , eprint=

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? , author=. 2024 , eprint=

  35. [36]

    Learning to Reason with LLMs , author =

  36. [37]

    2024 , eprint=

    CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities , author=. 2024 , eprint=

  37. [38]

    Preserving in-context learning ability in large language model fine-tuning , author=

  38. [39]

    2017 , publisher=

    The art and craft of problem solving , author=. 2017 , publisher=

  39. [41]

    2024 , eprint=

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author=. 2024 , eprint=

  40. [42]

    2024 , eprint=

    InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning , author=. 2024 , eprint=

  41. [43]

    2024 , howpublished =

    Mistral AI , title =. 2024 , howpublished =

  42. [44]

    2024 , howpublished =

    Anthropic , title =. 2024 , howpublished =

  43. [45]

    The Llama 3 Herd of Models

    Abhinav Pandey Abhimanyu Dubey, Abhinav Jauhri and et al. Abhishek Kadian. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  44. [46]

    Mathstral

    Mistral AI. Mathstral. https://mistral.ai/news/mathstral/, 2024. Accessed: 2024-7-16

  45. [47]

    Claude 3.5

    Anthropic. Claude 3.5. https://www.anthropic.com/news/claude-3-5-sonnet, 2024. Accessed: 2024-6-21

  46. [48]

    Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023

    Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023. URL https://arxiv.org/abs/2305.15074

  47. [49]

    Ayers, Dragomir Radev, and Jeremy Avigad

    Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023. URL https://arxiv.org/abs/2302.12433

  48. [50]

    Jiang, Jia Deng, Stella Biderman, and Sean Welleck

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics, 2024. URL https://arxiv.org/abs/2310.10631

  49. [51]

    Distribution of residual autocorrelations in autoregressive-integrated moving average time series models

    George EP Box and David A Pierce. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal of the American statistical Association, 65 0 (332): 0 1509--1526, 1970

  50. [52]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

  51. [53]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

  52. [54]

    DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao...

  53. [55]

    Problem-Solving Strategies

    Arthur Engel. Problem-Solving Strategies. Springer New York, NY, 1998. doi:https://doi.org/10.1007/b97682

  54. [56]

    Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024

    Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou. Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024. URL https://arxiv.org/abs/2406.18321

  55. [57]

    LLM critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback

    Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Tianyu Liu, et al. The reason behind good or bad: Towards a better mathematical verifier with natural language feedback. arXiv preprint arXiv:2406.14024, 2024

  56. [58]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008

  57. [59]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874

  58. [60]

    Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai, 2024

    Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, and Pengfei Liu. Olympicarena: Benc...

  59. [61]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024. URL https://arxiv.org/abs/2409.12186

  60. [62]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858

  61. [63]

    Numinamath

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...

  62. [64]

    Champ: A competition-level dataset for fine-grained analyses of llms' mathematical reasoning capabilities, 2024

    Yujun Mao, Yoon Kim, and Yilun Zhou. Champ: A competition-level dataset for fine-grained analyses of llms' mathematical reasoning capabilities, 2024. URL https://arxiv.org/abs/2401.06961

  63. [65]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815

  64. [66]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024. https://openai.com/index/learning-to-reason-with-llms/

  65. [67]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

  66. [68]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  67. [69]

    Solving olympiad geometry without human demonstrations

    Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625 0 (7995): 0 476--482, 2024

  68. [70]

    Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024. URL https://arxiv.org/abs/2312.08935

  69. [71]

    Preserving in-context learning ability in large language model fine-tuning

    Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix Yu, Cho-Jui Hsieh, Inderjit S Dhillon, and Sanjiv Kumar. Preserving in-context learning ability in large language model fine-tuning. 2022

  70. [72]

    Benchmarking benchmark leakage in large language models

    Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. Benchmarking benchmark leakage in large language models. arXiv preprint arXiv:2404.18824, 2024. URL https://arxiv.org/abs/2404.18824

  71. [73]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

  72. [74]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024 b . URL https://arxiv.org/abs/2409.12122

  73. [75]

    Can large language models always solve easy problems if they can solve harder ones?, 2024 c

    Zhe Yang, Yichang Zhang, Tianyu Liu, Jian Yang, Junyang Lin, Chang Zhou, and Zhifang Sui. Can large language models always solve easy problems if they can solve harder ones?, 2024 c . URL https://arxiv.org/abs/2406.12809

  74. [76]

    Internlm-math: Open math large language models toward verifiable reasoning, 2024

    Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. Internlm-math: Open math large language models toward verifiable reasoning, 20...

  75. [77]

    The art and craft of problem solving

    Paul Zeitz. The art and craft of problem solving. John Wiley & Sons, 2017

  76. [78]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URL https://arxiv.org/abs/2403.14624

  77. [79]

    Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022. URL https://arxiv.org/abs/2109.00110