arxiv: 2410.07985 · v3 · submitted 2024-10-10 · 💻 cs.CL

Recognition: no theorem link

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Bofei Gao , Feifan Song , Zhe Yang , Zefan Cai , Yibo Miao , Qingxiu Dong , Lei Li , Chenghao Ma

show 12 more authors

Liang Chen Runxin Xu Zhengyang Tang Benyou Wang Daoguang Zan Shanghaoran Quan Ge Zhang Lei Sha Yichang Zhang Xuancheng Ren Tianyu Liu Baobao Chang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords Olympiad mathematicsLLM benchmarkmathematical reasoningcompetition problemslarge language modelsevaluation dataset

0 comments

The pith

A new benchmark of 4428 Olympiad math problems shows even top models like o1-preview reach only 52.55% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Omni-MATH, a dataset of 4428 human-annotated competition-level mathematics problems drawn from Olympiads. These problems are grouped into more than 33 sub-domains and over 10 difficulty levels to test reasoning that exceeds current benchmarks. Evaluation reveals that OpenAI o1-mini solves 60.54% and o1-preview solves 52.55% of the problems, while older tests like MATH are now solved above 94%. The results establish that existing evaluation sets no longer distinguish advanced models on truly difficult mathematics. This benchmark supplies a finer-grained tool for measuring progress in high-level mathematical reasoning.

Core claim

The central claim is that Omni-MATH provides a comprehensive Olympiad-level benchmark consisting of 4428 rigorously annotated problems across more than 33 sub-domains and 10+ difficulty tiers, and that state-of-the-art models including o1-mini and o1-preview still achieve only 60.54% and 52.55% accuracy respectively on these problems.

What carries the argument

The Omni-MATH dataset of 4428 human-annotated Olympiad problems, organized by sub-domain and difficulty level, which serves as the evaluation instrument for model performance.

Load-bearing premise

The 4428 problems constitute a fair, unbiased, and comprehensive sample of Olympiad-level mathematics with human annotation free of selection bias or verification errors.

What would settle it

A model achieving sustained accuracy above 90% across the full set, or independent verification revealing widespread errors in problem statements or ground-truth answers.

read the original abstract

Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8\% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54\% and 52.55\% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Omni-MATH gives a bigger set of hard Olympiad problems than prior benchmarks and shows o1 models still top out around 50-60 percent, but the construction details are too thin to trust the numbers fully yet.

read the letter

The main thing to know is that this paper releases Omni-MATH, a collection of 4428 Olympiad-level math problems broken into 33 sub-domains and 10 difficulty levels, and finds that even OpenAI o1-mini and o1-preview only reach 60.54 percent and 52.55 percent accuracy on it. That is lower than their scores on easier sets like MATH, which makes the point that current models still have limits on true competition problems. The scale and the fine breakdown are the parts that stand out as new. Earlier benchmarks have become too easy for frontier models, so focusing only on Olympiad material and adding the category and difficulty splits lets you see where failures happen more clearly. The experiments give straightforward numbers that line up with the claim that advanced reasoning remains a challenge. The soft spot is the missing information on how the problems were picked and checked. The abstract says rigorous human annotation, but it does not lay out the contest sources, steps to remove duplicates or training-data overlap, or any error checks on the answers. Without those, it is hard to know if the accuracy figures are clean measures of reasoning or partly affected by bad items or contamination. That matches the stress-test concern, and it is the main thing that needs fixing before the results can be taken as definitive. This is for people who test or improve LLM reasoning on math that goes beyond high-school level. Researchers who want granular data on model weaknesses across topics will get the most from the breakdowns. It shows clear thinking on why a harder benchmark is needed and carries out the collection at decent scale. I would bring it to a reading group to talk through the results and the gaps in the methods. It deserves serious peer review because the core idea and the empirical findings are worth the community's time, though the data-construction section will need expansion. I recommend sending it out for review.

Referee Report

1 major / 2 minor

Summary. The paper introduces Omni-MATH, a benchmark of 4428 Olympiad-level mathematics competition problems with rigorous human annotation, spanning over 33 sub-domains and more than 10 difficulty levels. It evaluates multiple LLMs and reports that even the strongest models (OpenAI o1-mini at 60.54% and o1-preview at 52.55%) struggle, arguing that existing benchmarks like MATH are now saturated and insufficient for testing advanced reasoning.

Significance. If the dataset proves to be a clean, uncontaminated, and accurately annotated sample of Olympiad problems, the benchmark would be a valuable contribution. It directly addresses saturation in prior datasets (e.g., 94.8% on MATH) by providing scale, sub-domain coverage, and difficulty stratification that could support fine-grained diagnosis of LLM reasoning failures at the competition level.

major comments (1)

[Dataset construction and annotation section] The manuscript asserts 'rigorous human annotation' and coverage of 33 sub-domains but supplies no explicit protocol for contest sourcing, deduplication against public training corpora, inter-annotator agreement statistics, independent double-checking of solutions, or quantitative estimates of annotation error rates. These details are load-bearing for the validity of the headline accuracies, because even modest contamination or label errors would render the 60.54% / 52.55% figures unreliable indicators of reasoning limits.

minor comments (2)

[Title] The title contains a grammatical error ('Mathematic Benchmark' should read 'Mathematics Benchmark').
[Abstract] The abstract states 'more than 10 distinct difficulty levels' without describing the rubric or assignment procedure used to assign levels.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We agree that explicit documentation of the dataset construction and annotation process is essential to substantiate the benchmark's validity and the reliability of the reported model accuracies. We address the major comment below and will incorporate the necessary revisions to strengthen the paper.

read point-by-point responses

Referee: The manuscript asserts 'rigorous human annotation' and coverage of 33 sub-domains but supplies no explicit protocol for contest sourcing, deduplication against public training corpora, inter-annotator agreement statistics, independent double-checking of solutions, or quantitative estimates of annotation error rates. These details are load-bearing for the validity of the headline accuracies, because even modest contamination or label errors would render the 60.54% / 52.55% figures unreliable indicators of reasoning limits.

Authors: We agree that the current version of the manuscript lacks sufficient detail on these critical aspects of dataset construction. In the revised manuscript, we will expand the relevant section to explicitly describe: (1) the sourcing protocol, including the specific Olympiad contests, platforms, and selection criteria used to compile the 4428 problems; (2) the deduplication process against public training corpora, incorporating automated similarity detection, manual review, and exclusion of any overlapping items; (3) inter-annotator agreement statistics (e.g., percentage agreement and Cohen's kappa) from the multiple human annotators involved; (4) the independent double-checking protocol, where solutions were verified by additional expert mathematicians; and (5) quantitative estimates of annotation error rates based on spot-checks of a validation subset. These additions will provide the transparency needed to support the headline results and allow for a more rigorous evaluation of potential contamination or labeling issues. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark with direct measurements

full rationale

This is a data-collection and evaluation paper that introduces 4428 annotated Olympiad problems and reports model accuracies on them. There are no derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citation chains support load-bearing claims, and results are straightforward empirical measurements on held-out items. The derivation chain is empty by nature of the work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on expert curation of existing competition problems rather than new mathematical derivations or invented constructs.

axioms (1)

domain assumption Human experts can reliably identify and verify Olympiad-level problems without introducing selection bias
The benchmark's validity depends entirely on the quality and representativeness of the human-annotated collection.

pith-pipeline@v0.9.0 · 5573 in / 1238 out tokens · 55134 ms · 2026-05-15T09:04:10.133306+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
cs.AI 2026-04 accept novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
cs.LG 2026-05 unverdicted novelty 7.0

UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
cs.CL 2026-04 unverdicted novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism
cs.LO 2026-04 unverdicted novelty 7.0

ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
cs.AI 2025-05 unverdicted novelty 7.0

MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
cs.AI 2026-05 unverdicted novelty 6.0

A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
Controllable and Verifiable Process Data Synthesis for Process Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
cs.CL 2026-04 unverdicted novelty 6.0

DCM-Agent improves LLM performance on multi-paradigm optimization problems by 11-21% via dual-cluster memory construction and dynamic inference guidance.
OLLM: Options-based Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

OLLM models next-token generation as a latent-indexed set of options, enabling up to 70% math reasoning correctness versus 51% baselines and structure-based alignment via a compact latent policy.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
cs.LG 2025-12 conditional novelty 6.0

LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
cs.AI 2026-04 unverdicted novelty 5.0

An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
Riemann-Bench: A Benchmark for Moonshot Mathematics
cs.AI 2026-04 conditional novelty 5.0

Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 19 Pith papers · 12 internal anchors

[1]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[2]

2021 , eprint=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

work page 2021
[3]

2023 , eprint=

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models , author=. 2023 , eprint=

work page 2023
[4]

2022 , eprint=

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics , author=. 2022 , eprint=

work page 2022
[5]

2023 , eprint=

ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. 2023 , eprint=

work page 2023
[6]

2024 , eprint=

Llemma: An Open Language Model For Mathematics , author=. 2024 , eprint=

work page 2024
[7]

2024 , eprint=

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data , author=. 2024 , eprint=

work page 2024
[8]

2024 , eprint=

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author=. 2024 , eprint=

work page 2024
[9]

2024 , eprint=

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI , author=. 2024 , eprint=

work page 2024
[10]

2022 , eprint=

Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

work page 2022
[11]

Nature , volume=

Solving olympiad geometry without human demonstrations , author=. Nature , volume=. 2024 , publisher=

work page 2024
[12]

Hugging Face repository , howpublished =

Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. Hugging Face repository , howpublished =. 2024 , publisher =

work page 2024
[13]

2024 , journal=

Benchmarking Benchmark Leakage in Large Language Models , author=. 2024 , journal=

work page 2024
[14]

2023 , eprint=

Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers , author=. 2023 , eprint=

work page 2023
[15]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[16]

2024 , eprint=

A Survey on In-context Learning , author=. 2024 , eprint=

work page 2024
[17]

2024 , eprint=

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? , author=. 2024 , eprint=

work page 2024
[18]

Journal of the American statistical Association , volume=

Distribution of residual autocorrelations in autoregressive-integrated moving average time series models , author=. Journal of the American statistical Association , volume=. 1970 , publisher=

work page 1970
[20]

2023 , url=

GPT-4 Technical Report , author=. 2023 , url=

work page 2023
[21]

2024 , eprint=

Qwen2 Technical Report , author=. 2024 , eprint=

work page 2024
[22]

2024 , eprint=

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

work page 2024
[23]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[24]

2024 , eprint=

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , author=. 2024 , eprint=

work page 2024
[25]

2024 , eprint=

Qwen2.5-Coder Technical Report , author=. 2024 , eprint=

work page 2024
[26]

2024 , eprint=

Code Llama: Open Foundation Models for Code , author=. 2024 , eprint=

work page 2024
[27]

2024 , eprint=

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation , author=. 2024 , eprint=

work page 2024
[28]

2024 , eprint=

Tower: An Open Multilingual Large Language Model for Translation-Related Tasks , author=. 2024 , eprint=

work page 2024
[29]

2023 , eprint=

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models , author=. 2023 , eprint=

work page 2023
[30]

2024 , eprint=

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. 2024 , eprint=

work page 2024
[31]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024
[32]

2023 , eprint=

Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method , author=. 2023 , eprint=

work page 2023
[33]

An Iterative Optimizing Framework for Radiology Report Summarization With ChatGPT , volume=

Ma, Chong and Wu, Zihao and Wang, Jiaqi and Xu, Shaochen and Wei, Yaonai and Liu, Zhengliang and Zeng, Fang and Jiang, Xi and Guo, Lei and Cai, Xiaoyan and Zhang, Shu and Zhang, Tuo and Zhu, Dajiang and Shen, Dinggang and Liu, Tianming and Li, Xiang , year=. An Iterative Optimizing Framework for Radiology Report Summarization With ChatGPT , volume=. IEEE ...

work page doi:10.1109/tai.2024.3364586 2024
[34]

2024 , eprint=

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference , author=. 2024 , eprint=

work page 2024
[35]

2024 , eprint=

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? , author=. 2024 , eprint=

work page 2024
[36]

Learning to Reason with LLMs , author =

work page
[37]

2024 , eprint=

CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities , author=. 2024 , eprint=

work page 2024
[38]

Preserving in-context learning ability in large language model fine-tuning , author=

work page
[39]

2017 , publisher=

The art and craft of problem solving , author=. 2017 , publisher=

work page 2017
[41]

2024 , eprint=

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author=. 2024 , eprint=

work page 2024
[42]

2024 , eprint=

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning , author=. 2024 , eprint=

work page 2024
[43]

2024 , howpublished =

Mistral AI , title =. 2024 , howpublished =

work page 2024
[44]

2024 , howpublished =

Anthropic , title =. 2024 , howpublished =

work page 2024
[45]

The Llama 3 Herd of Models

Abhinav Pandey Abhimanyu Dubey, Abhinav Jauhri and et al. Abhishek Kadian. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Mathstral

Mistral AI. Mathstral. https://mistral.ai/news/mathstral/, 2024. Accessed: 2024-7-16

work page 2024
[47]

Claude 3.5

Anthropic. Claude 3.5. https://www.anthropic.com/news/claude-3-5-sonnet, 2024. Accessed: 2024-6-21

work page 2024
[48]

Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023

Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023. URL https://arxiv.org/abs/2305.15074

work page arXiv 2023
[49]

Ayers, Dragomir Radev, and Jeremy Avigad

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023. URL https://arxiv.org/abs/2302.12433

work page arXiv 2023
[50]

Jiang, Jia Deng, Stella Biderman, and Sean Welleck

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics, 2024. URL https://arxiv.org/abs/2310.10631

work page arXiv 2024
[51]

Distribution of residual autocorrelations in autoregressive-integrated moving average time series models

George EP Box and David A Pierce. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal of the American statistical Association, 65 0 (332): 0 1509--1526, 1970

work page 1970
[52]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[54]

DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao...

work page arXiv 2024
[55]

Problem-Solving Strategies

Arthur Engel. Problem-Solving Strategies. Springer New York, NY, 1998. doi:https://doi.org/10.1007/b97682

work page doi:10.1007/b97682 1998
[56]

Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024

Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou. Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024. URL https://arxiv.org/abs/2406.18321

work page arXiv 2024
[57]

LLM critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback

Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Tianyu Liu, et al. The reason behind good or bad: Towards a better mathematical verifier with natural language feedback. arXiv preprint arXiv:2406.14024, 2024

work page arXiv 2024
[58]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai, 2024

Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, and Pengfei Liu. Olympicarena: Benc...

work page arXiv 2024
[61]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024. URL https://arxiv.org/abs/2409.12186

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

Numinamath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...

work page 2024
[64]

Champ: A competition-level dataset for fine-grained analyses of llms' mathematical reasoning capabilities, 2024

Yujun Mao, Yoon Kim, and Yilun Zhou. Champ: A competition-level dataset for fine-grained analyses of llms' mathematical reasoning capabilities, 2024. URL https://arxiv.org/abs/2401.06961

work page arXiv 2024
[65]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815

work page 2023
[66]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024. https://openai.com/index/learning-to-reason-with-llms/

work page 2024
[67]

Code Llama: Open Foundation Models for Code

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Solving olympiad geometry without human demonstrations

Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625 0 (7995): 0 476--482, 2024

work page 2024
[70]

Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024. URL https://arxiv.org/abs/2312.08935

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Preserving in-context learning ability in large language model fine-tuning

Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix Yu, Cho-Jui Hsieh, Inderjit S Dhillon, and Sanjiv Kumar. Preserving in-context learning ability in large language model fine-tuning. 2022

work page 2022
[72]

Benchmarking benchmark leakage in large language models

Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. Benchmarking benchmark leakage in large language models. arXiv preprint arXiv:2404.18824, 2024. URL https://arxiv.org/abs/2404.18824

work page arXiv 2024
[73]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024 b . URL https://arxiv.org/abs/2409.12122

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Can large language models always solve easy problems if they can solve harder ones?, 2024 c

Zhe Yang, Yichang Zhang, Tianyu Liu, Jian Yang, Junyang Lin, Chang Zhou, and Zhifang Sui. Can large language models always solve easy problems if they can solve harder ones?, 2024 c . URL https://arxiv.org/abs/2406.12809

work page arXiv 2024
[76]

Internlm-math: Open math large language models toward verifiable reasoning, 2024

Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. Internlm-math: Open math large language models toward verifiable reasoning, 20...

work page arXiv 2024
[77]

The art and craft of problem solving

Paul Zeitz. The art and craft of problem solving. John Wiley & Sons, 2017

work page 2017
[78]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URL https://arxiv.org/abs/2403.14624

work page arXiv 2024
[79]

Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022. URL https://arxiv.org/abs/2109.00110

work page arXiv 2022