arxiv: 2505.23281 · v3 · submitted 2025-05-29 · 💻 cs.AI · cs.CL

Recognition: 1 theorem link

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunovi\'c , Jasper Dekoninck , Ivo Petrov , Nikola Jovanovi\'c , Martin Vechev

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:05 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM evaluationmathematical reasoningbenchmark contaminationproof writingmath competitionsIMOAIME

0 comments

The pith

MathArena evaluates LLMs on math competition problems released after their training data cutoffs to eliminate contamination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MathArena as a benchmark that draws problems from recurring math competitions released in real time. This setup tests whether models can solve fresh, high-quality problems rather than recall material from common online datasets. Evaluations on contests such as CMIMC 2025 show strong reasoning in top models, while proof-writing tasks on IMO 2025 yield scores below 40 percent. The work also detects clear signs of contamination in widely used benchmarks like AIME 2024. A sympathetic reader would care because the method supplies an ongoing, verifiable way to measure genuine mathematical progress instead of inflated scores from memorized examples.

Core claim

MathArena is a benchmark that uses problems from recurring math competitions released after model training cutoffs. This produces contamination-free evaluations across more than 50 models and 162 problems from seven contests. Results show contamination in AIME 2024, strong reasoning on harder contests such as CMIMC 2025, and a clear gap in proof-writing with top models scoring slightly less than 40 percent on IMO 2025.

What carries the argument

The central mechanism is real-time evaluation on newly released problems from recurring competitions, which supplies a continuous stream of fresh test items.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other fields that release regular high-quality challenges, such as programming or physics contests.
Models may need targeted training on formal proof structures to close the observed gap.
Ongoing updates to the benchmark could become a standard practice for keeping AI evaluations current and fair.

Load-bearing premise

Newly released competition problems have never appeared in any training corpus or web scrape used by the evaluated models.

What would settle it

Locating any of the 2025 contest problems used in MathArena inside the training data of a top-performing model would disprove the contamination-free claim.

read the original abstract

The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On IMO 2025, top models achieve slightly less than 40%, demonstrating both notable progress and significant room for improvement. So far, we have evaluated over $50$ models across seven competitions, totaling $162$ problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

MathArena sets up evaluations on fresh contest problems to flag contamination in older sets and adds proof-writing tests where models still lag, but the no-contamination guarantee rests on timing without extra checks. The paper's main move is to pull problems from ongoing competitions like CMIMC 2025 and IMO 2025 and run models on them right after release. This caught contamination in AIME 2024 and produced scores showing top models handle some new contest problems reasonably well while scoring under 40% on writing proofs for IMO 2025. They have already run more than 50 models across 162 problems from seven contests and intend to keep adding new ones as they appear. The live-stream approach is straightforward and addresses a real problem with static benchmarks that end up online. Including proof-writing tasks is a clear step forward since most existing math evals only check final answers. The results give a broad snapshot of current capabilities and the gap on formal proofs. The soft spot is the central assumption that problems released after a model's data cutoff have not appeared anywhere in training data through leaks, forums, or scrapes. The paper treats the timing as sufficient to eliminate risk but does not describe archive searches, model probes for prior exposure, or other verification steps. Without those, the performance gap between AIME and the newer contests is harder to interpret cleanly. Proof grading criteria are also not spelled out, which limits reproducibility. Researchers tracking LLM reasoning progress will find this useful as an evolving benchmark they can monitor. It deserves serious peer review because the setup is practical and the initial numbers are concrete enough to discuss, even if the contamination claim would benefit from tighter support.

Referee Report

3 major / 2 minor

Summary. The paper introduces MathArena, a benchmark that evaluates LLMs on newly released problems from math competitions (AIME, CMIMC, IMO, etc.) to avoid contamination from training data. It reports strong evidence of contamination in AIME 2024, impressive reasoning on harder contests such as CMIMC 2025, and the first systematic results on proof-writing, with top models scoring slightly below 40% on IMO 2025. Over 50 models were tested on 162 problems total, with the benchmark positioned as an evolving, real-time evaluation framework.

Significance. If the no-contamination premise is substantiated, MathArena supplies a valuable, extensible resource for measuring genuine generalization in LLM mathematical reasoning, especially proof generation, which existing benchmarks largely omit. The empirical contrast between contaminated and post-cutoff contests, together with the ongoing release pipeline, could set a precedent for contamination-resistant evaluation in AI.

major comments (3)

[Introduction and §3] The central methodological claim (Introduction and §3) that immediate post-release evaluation eliminates contamination risk is load-bearing yet rests on an unverified assumption; no archive searches, web-probe experiments, or training-data overlap checks are described for the 162 problems, leaving open the possibility that unofficial leaks or forum posts reached training corpora before model cutoffs.
[§4.3] §4.3 (IMO 2025 evaluation): the reported top-model score of slightly less than 40% on proof-writing lacks detail on grading protocol, including rubric, whether grading was automated or human, number of graders, and inter-rater agreement; without these, the quantitative claim cannot be assessed for reliability.
[§4.1] §4.1 (contamination detection for AIME 2024): the exact procedure, metrics, and thresholds used to identify 'strong signs of contamination' are not specified, preventing readers from determining whether analogous undetected leakage could affect the harder-contest results.

minor comments (2)

[Table 1] A consolidated table listing all seven competitions, their release dates, and problem counts would improve readability and allow quick cross-reference with the reported scores.
[Throughout] Model names and abbreviations are introduced inconsistently; a single nomenclature table or footnote list would reduce ambiguity across figures and tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications and indicating where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: [Introduction and §3] The central methodological claim (Introduction and §3) that immediate post-release evaluation eliminates contamination risk is load-bearing yet rests on an unverified assumption; no archive searches, web-probe experiments, or training-data overlap checks are described for the 162 problems, leaving open the possibility that unofficial leaks or forum posts reached training corpora before model cutoffs.

Authors: We agree that the manuscript would benefit from greater transparency on this point. While immediate post-release evaluation inherently limits the opportunity for contamination relative to static benchmarks, we recognize that unofficial leaks remain a theoretical possibility. In the revised version we will expand the Introduction and §3 to include explicit timelines of each contest's official release dates, our evaluation dates, and any checks we performed for public availability on official sites and major forums before model cutoffs. We will also add a limitations paragraph acknowledging that absolute verification of training-data absence is infeasible and explaining why the approach still offers stronger protection than existing benchmarks. revision: partial
Referee: [§4.3] §4.3 (IMO 2025 evaluation): the reported top-model score of slightly less than 40% on proof-writing lacks detail on grading protocol, including rubric, whether grading was automated or human, number of graders, and inter-rater agreement; without these, the quantitative claim cannot be assessed for reliability.

Authors: We accept this criticism and will substantially expand §4.3. The revised text will state that all proofs were graded manually by expert mathematicians using a rubric adapted from official IMO scoring guidelines, with emphasis on mathematical correctness, completeness, and clarity. Grading was performed independently by two graders, with a third expert resolving any disagreements; we will report the resulting inter-rater agreement. The full rubric will be included in the appendix. revision: yes
Referee: [§4.1] §4.1 (contamination detection for AIME 2024): the exact procedure, metrics, and thresholds used to identify 'strong signs of contamination' are not specified, preventing readers from determining whether analogous undetected leakage could affect the harder-contest results.

Authors: We will revise §4.1 to describe the detection procedure in full. The method compared model accuracy on AIME 2024 against expected performance derived from similar problems in prior uncontaminated contests, using quantitative metrics such as accuracy deviation and qualitative inspection of solution patterns for signs of memorization. Thresholds were defined via statistical outliers relative to baseline models. The revised section will specify the exact metrics and thresholds so readers can evaluate the strength of the evidence and apply analogous reasoning to other contests. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark scores on external contests with no derived predictions or self-referential reductions

full rationale

The paper presents an empirical benchmark (MathArena) that scores LLMs on newly released competition problems (AIME 2024, CMIMC 2025, IMO 2025, etc.). Central results are raw performance percentages across 162 problems and 50+ models. No equations, fitted parameters, or first-principles derivations exist that could reduce to inputs by construction. The key methodological claim (evaluating 'as soon as new problems are released' eliminates contamination) is an unverified assumption about external data, not a self-definitional or fitted prediction inside the paper. No self-citations are load-bearing for any quantitative result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on the domain assumption that contest problems released after a model's training cutoff are uncontaminated; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Newly released math competition problems have not appeared in any training data or web scrape used by the evaluated LLMs.
This is the central premise stated in the abstract as the key insight enabling uncontaminated evaluation.

pith-pipeline@v0.9.0 · 5560 in / 1269 out tokens · 32411 ms · 2026-05-15T00:05:06.072717+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
cs.LG 2026-05 unverdicted novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
cs.AI 2026-04 accept novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
cs.AI 2026-05 unverdicted novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
cs.CL 2026-05 unverdicted novelty 7.0

LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs
cs.AI 2026-04 unverdicted novelty 7.0

MEDS is a dataset of 28,000 LLM personas performing high-school math tasks alongside psychometric tests and cognitive networks that capture math anxiety, self-efficacy, and confidence to support safer AI tutors.
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
cs.LG 2026-04 unverdicted novelty 7.0

EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
Stateful Reasoning via Insight Replay
cs.AI 2026-05 conditional novelty 6.0

InsightReplay improves LLM accuracy on reasoning benchmarks by extracting and replaying critical insights to maintain their accessibility during extended chain-of-thought generation.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
DataMaster: Data-Centric Autonomous AI Research
cs.LG 2026-05 unverdicted novelty 6.0

DataMaster autonomously optimizes data via tree search and shared memory, raising medal rate 32.27% on MLE-Bench Lite and beating the base instruct model on GPQA.
DataMaster: Data-Centric Autonomous AI Research
cs.LG 2026-05 unverdicted novelty 6.0

DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GP...
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
cs.AI 2026-05 unverdicted novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...
An Interpretable and Scalable Framework for Evaluating Large Language Models
stat.ML 2026-05 unverdicted novelty 6.0

A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
cs.CL 2026-04 unverdicted novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
cs.LG 2026-04 unverdicted novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
cs.CL 2026-05 unverdicted novelty 5.0

MathArena is a maintained platform evaluating LLMs across olympiad problems, proofs, research questions, and formal proofs, with GPT-5.5 reaching 98% on 2026 USAMO and 74% on research-level tasks.
Beyond Distribution Sharpening: The Importance of Task Rewards
cs.LG 2026-04 unverdicted novelty 5.0

Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
Seed1.8 Model Card: Towards Generalized Real-World Agency
cs.AI 2026-03 unverdicted novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 22 Pith papers · 7 internal anchors

[1]

Siavash Ameli, Siyuan Zhuang, Ion Stoica, and Michael W

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora Y...

work page arXiv 2025
[2]

2025 aime i

Art of Problem Solving. 2025 aime i. Art of Problem Solving Wiki, 2025. URL https: //artofproblemsolving.com/wiki/index.php/2025_AIME_I. Accessed: 2025

work page 2025
[3]

2025 aime ii

Art of Problem Solving. 2025 aime ii. Art of Problem Solving Wiki, 2025. URL https: //artofproblemsolving.com/wiki/index.php/2025_AIME_II. Accessed: 2025

work page 2025
[4]

Brown university math olympiad 2025, 2025

BRUMO. Brown university math olympiad 2025, 2025. URL https://www.brumo.org/. Accessed: 2025

work page 2025
[5]

Cmimc 2025, 2025

CMIMC. Cmimc 2025, 2025. URLhttps://cmimc.math.cmu.edu/. Accessed: 2025

work page 2025
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Math- construct: Challenging llm reasoning with constructive proofs

Jasper Dekoninck, Mislav Balunovic, Nikola Jovanovi´c, Ivo Petrov, and Martin Vechev. Math- construct: Challenging llm reasoning with constructive proofs. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild

work page 2025
[10]

Openai and frontiermath.Epoch AI Blog, 2024

Epoch. Openai and frontiermath.Epoch AI Blog, 2024. URL https://epoch.ai/blog/ openai-and-frontiermath

work page 2024
[11]

Project euler, 2025

Project Euler. Project euler, 2025. URLhttps://projecteuler.net/. Accessed: 2025

work page 2025
[12]

International mathematical olympiad, 2025

IMO Foundation. International mathematical olympiad, 2025. URL https://www. imo-official.org/. Accessed: 2025

work page 2025
[13]

Mathematical capabilities of chatgpt

Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

work page 2023
[14]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models. CoRR, abs/2410.07...

work page internal anchor Pith review arXiv 2024
[15]

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in AI.arXiv, 2024

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, 1...

work page 2024
[16]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. InACL (1), pages 3828–3850. Association for Computational Linguist...

work page 2024
[17]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021

work page 2021
[18]

Hmmt 2025, 2025

HMMT. Hmmt 2025, 2025. URLhttps://www.hmmt.org/. Accessed: 2025

work page 2025
[20]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

FIMO: A challenge formal dataset for automated theorem proving.CoRR, abs/2309.04295, 2023

Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu. FIMO: A challenge formal dataset for automated theorem proving.CoRR, abs/2309.04295, 2023

work page arXiv 2023
[22]

Brains vs

Hamed Mahdavi, Alireza Hashemi, Majid Daliri, Pegah Mohammadipour, Alireza Farhadi, Samira Malek, Yekta Yazdanifard, Amir Khasahmadi, and Vasant Honavar. Brains vs. bytes: Evaluating llm proficiency in olympiad mathematics.arXiv preprint arXiv:2504.01995, 2025

work page arXiv 2025
[23]

Leveraging online olympiad-level math problems for llms training and contamination- resistant evaluation.CoRR, abs/2501.14275, 2025

Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, and Renjie Liao. Leveraging online olympiad-level math problems for llms training and contamination- resistant evaluation.CoRR, abs/2501.14275, 2025

work page arXiv 2025
[24]

Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/...

work page 2025
[25]

2025 putnam mathematical competition, 2025

Mathematical Association of America. 2025 putnam mathematical competition, 2025. URL https://maa.org/putnam/. Accessed: 2025

work page 2025
[26]

2025 usa math olympiad, 2025

Art of Problem Solving. 2025 usa math olympiad, 2025. URLhttps://artofproblemsolving. com/wiki/index.php/2025_USAMO. Accessed: 2025

work page 2025
[27]

Deep research, 2025

OpenAI. Deep research, 2025. URL https://openai.com/index/ introducing-deep-research/

work page 2025
[28]

Proof or bluff? evaluating llms on 2025 usa math olympiad.arXiv preprint arXiv:2503.21934, 2025

Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunovi´c, Nikola Jovanovi´c, and Martin Vechev. Proof or bluff? evaluating llms on 2025 usa math olympiad.arXiv preprint arXiv:2503.21934, 2025

work page arXiv 2025
[29]

Humanity’s last exam.arXiv, 2025

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, et al. Humanity’s last exam.arXiv, 2025

work page 2025
[30]

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, Lei Fang, and Ji-Rong Wen. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models.CoRR, abs/2503.21380, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.CoRR, abs/2507.06261, 2025. doi: 10.48550/ARXIV .2507.06261. URLhttps://doi.org/10.48550/arXiv.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[32]

PutnamBench: Evaluating Neural Theorem‑Provers on the Putnam Mathematical Com- petition, 2024

George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition.CoRR, abs/2407.11214, 2024

work page arXiv 2024
[33]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.19314, 2024

work page internal anchor Pith review arXiv 2024
[34]

Grok 3 beta — the age of reasoning agents, February 2025

xAI Team. Grok 3 beta — the age of reasoning agents, February 2025. URL https://x.ai/ news/grok-3. News post

work page 2025
[35]

Lean workbook: A large-scale lean problem set formalized from natural language math problems.arXiv preprint arXiv:2406.03847, 2024

Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, and Kai Chen. Lean workbook: A large-scale lean problem set formalized from natural language math problems.arXiv preprint arXiv:2406.03847, 2024

work page arXiv 2024
[36]

Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, and Aaditya K

Albert S. Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, and Aaditya K. Singh. HARP: A challenging human-annotated math reasoning benchmark.CoRR, abs/2412.08819, 2024

work page arXiv 2024
[37]

A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, William Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, et al. A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

work page 2024
[38]

Exploring the compositional deficiency of large language models in mathematical reasoning.arXiv preprint arXiv:2405.06680, 2024

Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, and Xuanjing Huang. Exploring the compositional deficiency of large language models in mathematical reasoning.arXiv preprint arXiv:2405.06680, 2024

work page arXiv 2024
[39]

minif2f: a cross-system benchmark for formal olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InICLR. OpenReview.net, 2022

work page 2022
[40]

### Final answer: \\boxed{your_answer_here}

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum? id=V...

work page arXiv 2025
[41]

sunny,” that together contain every point ofS n. Recall a line is “sunny

If two snakes ever touch, they merge to form a single snake occupying the arcs of both of the previous snakes, and the merged snake moves as one snake. Compute the expected number of seconds until there is only one snake left. Solution by GROK4 FAST 448 3 Explanation The problem involves 15 snakes on a circle of 45 arcs, initially placed every third arc, ...

work page 2025
[42]

One can coverS n withnlines having exactlyk∈ {0,1,3}sunny lines

work page
[43]

grid directions

No other value ofkis possible. — Step 1 - Reduction to a smaller triangular grid.LetLbe any collection ofnlines covering Sn, with s of them non–sunny and k=n−s sunny. Since non–sunny lines are parallel to one of the three directions (horizontal, vertical, or antidiagonal x+y=const ), each non–sunny line covers points in at most one of the three “grid dire...

work page 2025