Recognition: no theorem link
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Pith reviewed 2026-05-15 09:04 UTC · model grok-4.3
The pith
A new benchmark of 4428 Olympiad math problems shows even top models like o1-preview reach only 52.55% accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Omni-MATH provides a comprehensive Olympiad-level benchmark consisting of 4428 rigorously annotated problems across more than 33 sub-domains and 10+ difficulty tiers, and that state-of-the-art models including o1-mini and o1-preview still achieve only 60.54% and 52.55% accuracy respectively on these problems.
What carries the argument
The Omni-MATH dataset of 4428 human-annotated Olympiad problems, organized by sub-domain and difficulty level, which serves as the evaluation instrument for model performance.
Load-bearing premise
The 4428 problems constitute a fair, unbiased, and comprehensive sample of Olympiad-level mathematics with human annotation free of selection bias or verification errors.
What would settle it
A model achieving sustained accuracy above 90% across the full set, or independent verification revealing widespread errors in problem statements or ground-truth answers.
read the original abstract
Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8\% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54\% and 52.55\% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Omni-MATH, a benchmark of 4428 Olympiad-level mathematics competition problems with rigorous human annotation, spanning over 33 sub-domains and more than 10 difficulty levels. It evaluates multiple LLMs and reports that even the strongest models (OpenAI o1-mini at 60.54% and o1-preview at 52.55%) struggle, arguing that existing benchmarks like MATH are now saturated and insufficient for testing advanced reasoning.
Significance. If the dataset proves to be a clean, uncontaminated, and accurately annotated sample of Olympiad problems, the benchmark would be a valuable contribution. It directly addresses saturation in prior datasets (e.g., 94.8% on MATH) by providing scale, sub-domain coverage, and difficulty stratification that could support fine-grained diagnosis of LLM reasoning failures at the competition level.
major comments (1)
- [Dataset construction and annotation section] The manuscript asserts 'rigorous human annotation' and coverage of 33 sub-domains but supplies no explicit protocol for contest sourcing, deduplication against public training corpora, inter-annotator agreement statistics, independent double-checking of solutions, or quantitative estimates of annotation error rates. These details are load-bearing for the validity of the headline accuracies, because even modest contamination or label errors would render the 60.54% / 52.55% figures unreliable indicators of reasoning limits.
minor comments (2)
- [Title] The title contains a grammatical error ('Mathematic Benchmark' should read 'Mathematics Benchmark').
- [Abstract] The abstract states 'more than 10 distinct difficulty levels' without describing the rubric or assignment procedure used to assign levels.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We agree that explicit documentation of the dataset construction and annotation process is essential to substantiate the benchmark's validity and the reliability of the reported model accuracies. We address the major comment below and will incorporate the necessary revisions to strengthen the paper.
read point-by-point responses
-
Referee: The manuscript asserts 'rigorous human annotation' and coverage of 33 sub-domains but supplies no explicit protocol for contest sourcing, deduplication against public training corpora, inter-annotator agreement statistics, independent double-checking of solutions, or quantitative estimates of annotation error rates. These details are load-bearing for the validity of the headline accuracies, because even modest contamination or label errors would render the 60.54% / 52.55% figures unreliable indicators of reasoning limits.
Authors: We agree that the current version of the manuscript lacks sufficient detail on these critical aspects of dataset construction. In the revised manuscript, we will expand the relevant section to explicitly describe: (1) the sourcing protocol, including the specific Olympiad contests, platforms, and selection criteria used to compile the 4428 problems; (2) the deduplication process against public training corpora, incorporating automated similarity detection, manual review, and exclusion of any overlapping items; (3) inter-annotator agreement statistics (e.g., percentage agreement and Cohen's kappa) from the multiple human annotators involved; (4) the independent double-checking protocol, where solutions were verified by additional expert mathematicians; and (5) quantitative estimates of annotation error rates based on spot-checks of a validation subset. These additions will provide the transparency needed to support the headline results and allow for a more rigorous evaluation of potential contamination or labeling issues. revision: yes
Circularity Check
No circularity; empirical benchmark with direct measurements
full rationale
This is a data-collection and evaluation paper that introduces 4428 annotated Olympiad problems and reports model accuracies on them. There are no derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citation chains support load-bearing claims, and results are straightforward empirical measurements on held-out items. The derivation chain is empty by nature of the work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human experts can reliably identify and verify Olympiad-level problems without introducing selection bias
Forward citations
Cited by 19 Pith papers
-
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
-
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
-
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
-
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...
-
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
-
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
-
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
-
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
-
Controllable and Verifiable Process Data Synthesis for Process Reward Models
A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.
-
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
DCM-Agent improves LLM performance on multi-paradigm optimization problems by 11-21% via dual-cluster memory construction and dynamic inference guidance.
-
OLLM: Options-based Large Language Models
OLLM models next-token generation as a latent-indexed set of options, enabling up to 70% math reasoning correctness versus 51% baselines and structure-based alignment via a compact latent policy.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
-
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
-
Riemann-Bench: A Benchmark for Moonshot Mathematics
Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Reference graph
Works this paper leans on
-
[1]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[2]
Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=
work page 2021
-
[3]
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models , author=. 2023 , eprint=
work page 2023
-
[4]
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics , author=. 2022 , eprint=
work page 2022
-
[5]
ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics , author=. 2023 , eprint=
work page 2023
-
[6]
Llemma: An Open Language Model For Mathematics , author=. 2024 , eprint=
work page 2024
-
[7]
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data , author=. 2024 , eprint=
work page 2024
-
[8]
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author=. 2024 , eprint=
work page 2024
-
[9]
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI , author=. 2024 , eprint=
work page 2024
-
[10]
Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=
work page 2022
-
[11]
Solving olympiad geometry without human demonstrations , author=. Nature , volume=. 2024 , publisher=
work page 2024
-
[12]
Hugging Face repository , howpublished =
Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. Hugging Face repository , howpublished =. 2024 , publisher =
work page 2024
-
[13]
Benchmarking Benchmark Leakage in Large Language Models , author=. 2024 , journal=
work page 2024
-
[14]
Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers , author=. 2023 , eprint=
work page 2023
-
[15]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
- [16]
-
[17]
Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? , author=. 2024 , eprint=
work page 2024
-
[18]
Journal of the American statistical Association , volume=
Distribution of residual autocorrelations in autoregressive-integrated moving average time series models , author=. Journal of the American statistical Association , volume=. 1970 , publisher=
work page 1970
- [20]
- [21]
-
[22]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=
work page 2024
- [23]
-
[24]
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , author=. 2024 , eprint=
work page 2024
- [25]
- [26]
-
[27]
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation , author=. 2024 , eprint=
work page 2024
-
[28]
Tower: An Open Multilingual Large Language Model for Translation-Related Tasks , author=. 2024 , eprint=
work page 2024
-
[29]
BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models , author=. 2023 , eprint=
work page 2023
-
[30]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. 2024 , eprint=
work page 2024
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=
work page 2024
-
[32]
Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method , author=. 2023 , eprint=
work page 2023
-
[33]
An Iterative Optimizing Framework for Radiology Report Summarization With ChatGPT , volume=
Ma, Chong and Wu, Zihao and Wang, Jiaqi and Xu, Shaochen and Wei, Yaonai and Liu, Zhengliang and Zeng, Fang and Jiang, Xi and Guo, Lei and Cai, Xiaoyan and Zhang, Shu and Zhang, Tuo and Zhu, Dajiang and Shen, Dinggang and Liu, Tianming and Li, Xiang , year=. An Iterative Optimizing Framework for Radiology Report Summarization With ChatGPT , volume=. IEEE ...
-
[34]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference , author=. 2024 , eprint=
work page 2024
-
[35]
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? , author=. 2024 , eprint=
work page 2024
-
[36]
Learning to Reason with LLMs , author =
-
[37]
CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities , author=. 2024 , eprint=
work page 2024
-
[38]
Preserving in-context learning ability in large language model fine-tuning , author=
- [39]
-
[41]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author=. 2024 , eprint=
work page 2024
-
[42]
InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning , author=. 2024 , eprint=
work page 2024
- [43]
- [44]
-
[45]
Abhinav Pandey Abhimanyu Dubey, Abhinav Jauhri and et al. Abhishek Kadian. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [46]
-
[47]
Anthropic. Claude 3.5. https://www.anthropic.com/news/claude-3-5-sonnet, 2024. Accessed: 2024-6-21
work page 2024
-
[48]
Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023
Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023. URL https://arxiv.org/abs/2305.15074
-
[49]
Ayers, Dragomir Radev, and Jeremy Avigad
Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023. URL https://arxiv.org/abs/2302.12433
-
[50]
Jiang, Jia Deng, Stella Biderman, and Sean Welleck
Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics, 2024. URL https://arxiv.org/abs/2310.10631
-
[51]
George EP Box and David A Pierce. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal of the American statistical Association, 65 0 (332): 0 1509--1526, 1970
work page 1970
-
[52]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[54]
DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao...
-
[55]
Arthur Engel. Problem-Solving Strategies. Springer New York, NY, 1998. doi:https://doi.org/10.1007/b97682
-
[56]
Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou. Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024. URL https://arxiv.org/abs/2406.18321
-
[57]
Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Tianyu Liu, et al. The reason behind good or bad: Towards a better mathematical verifier with natural language feedback. arXiv preprint arXiv:2406.14024, 2024
-
[58]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[60]
Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai, 2024
Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, and Pengfei Liu. Olympicarena: Benc...
-
[61]
Qwen2.5-Coder Technical Report
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024. URL https://arxiv.org/abs/2409.12186
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[63]
Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...
work page 2024
-
[64]
Yujun Mao, Yoon Kim, and Yilun Zhou. Champ: A competition-level dataset for fine-grained analyses of llms' mathematical reasoning capabilities, 2024. URL https://arxiv.org/abs/2401.06961
-
[65]
OpenAI. Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815
work page 2023
-
[66]
Learning to reason with llms, 2024
OpenAI. Learning to reason with llms, 2024. https://openai.com/index/learning-to-reason-with-llms/
work page 2024
-
[67]
Code Llama: Open Foundation Models for Code
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Solving olympiad geometry without human demonstrations
Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625 0 (7995): 0 476--482, 2024
work page 2024
-
[70]
Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024. URL https://arxiv.org/abs/2312.08935
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Preserving in-context learning ability in large language model fine-tuning
Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix Yu, Cho-Jui Hsieh, Inderjit S Dhillon, and Sanjiv Kumar. Preserving in-context learning ability in large language model fine-tuning. 2022
work page 2022
-
[72]
Benchmarking benchmark leakage in large language models
Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. Benchmarking benchmark leakage in large language models. arXiv preprint arXiv:2404.18824, 2024. URL https://arxiv.org/abs/2404.18824
-
[73]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024 b . URL https://arxiv.org/abs/2409.12122
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Can large language models always solve easy problems if they can solve harder ones?, 2024 c
Zhe Yang, Yichang Zhang, Tianyu Liu, Jian Yang, Junyang Lin, Chang Zhou, and Zhifang Sui. Can large language models always solve easy problems if they can solve harder ones?, 2024 c . URL https://arxiv.org/abs/2406.12809
-
[76]
Internlm-math: Open math large language models toward verifiable reasoning, 2024
Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. Internlm-math: Open math large language models toward verifiable reasoning, 20...
-
[77]
The art and craft of problem solving
Paul Zeitz. The art and craft of problem solving. John Wiley & Sons, 2017
work page 2017
-
[78]
Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URL https://arxiv.org/abs/2403.14624
-
[79]
Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022. URL https://arxiv.org/abs/2109.00110
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.