Recognition: no theorem link
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Pith reviewed 2026-05-17 00:37 UTC · model grok-4.3
The pith
FrontierMath shows that current AI models solve under 2% of hundreds of original expert-level mathematics problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI 1.
What carries the argument
The FrontierMath benchmark of original unpublished problems paired with automated verification to test advanced mathematical reasoning.
If this is right
- Supplies a contamination-resistant testbed for measuring AI progress toward expert-level mathematical abilities.
- Shows that present models remain far below the performance of human mathematicians on problems across number theory, analysis, algebraic geometry, and category theory.
- Allows consistent tracking of improvements as AI systems develop better reasoning methods.
- Sets evaluation standards that match the multi-hour or multi-day effort typical for human experts.
Where Pith is reading between the lines
- Persistent low scores may suggest that current AI training approaches require new components to reach expert mathematical performance.
- The benchmark could support comparisons with other scientific reasoning tasks to identify where math presents unique difficulties.
- High performance on FrontierMath problems might eventually link to an AI system's capacity for producing original mathematical results.
Load-bearing premise
The problems are genuinely original and unpublished with no data contamination risk, and automated verification reliably measures true mathematical reasoning ability.
What would settle it
A current leading AI model achieving success on more than 10 percent of the FrontierMath problems without prior exposure to them would challenge the reported performance gap.
read the original abstract
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FrontierMath, a benchmark of hundreds of original, expert-vetted mathematics problems spanning computational areas like number theory and real analysis as well as abstract topics in algebraic geometry and category theory. Problems are designed to require hours to days of expert effort, and the manuscript reports that current state-of-the-art AI models solve under 2% of them using automated verification on unpublished problems to reduce contamination risk.
Significance. If the evaluation methodology proves robust, FrontierMath would provide a valuable, high-bar testbed for tracking progress toward expert-level mathematical reasoning in AI systems. The emphasis on original problems and broad coverage of modern mathematics branches is a positive feature that could help quantify the claimed gap between AI performance and human expertise.
major comments (2)
- [Abstract and verification description] The automated verification procedure for abstract problems is under-specified. The abstract states that problems include 'abstract questions in algebraic geometry and category theory' and relies on 'automated verification,' yet no details are given on answer formats, checker implementation, handling of multi-step proofs or constructions, or acceptance criteria for equivalent but non-canonical solutions. This directly affects the reliability of the <2% solve-rate claim.
- [Benchmark design and problem selection] Details on problem curation and vetting for originality are insufficient. The central claim depends on the problems being genuinely new and unpublished to avoid data contamination, but the manuscript provides limited information on the expert review process or safeguards against prior publication.
minor comments (2)
- [Abstract] The exact number of problems and their distribution across subfields should be reported more precisely rather than as 'hundreds' to allow better assessment of statistical power.
- [Introduction] Consider adding a table or section comparing FrontierMath to existing benchmarks (e.g., MATH, GSM8K) in terms of difficulty and verification approach for context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential value of FrontierMath as a high-bar benchmark. We address each major comment below, providing clarifications and committing to specific revisions that will strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract and verification description] The automated verification procedure for abstract problems is under-specified. The abstract states that problems include 'abstract questions in algebraic geometry and category theory' and relies on 'automated verification,' yet no details are given on answer formats, checker implementation, handling of multi-step proofs or constructions, or acceptance criteria for equivalent but non-canonical solutions. This directly affects the reliability of the <2% solve-rate claim.
Authors: We agree that the manuscript's description of automated verification is high-level and would benefit from greater specificity, particularly for abstract problems. In the revised version we will add a dedicated subsection on verification procedures. This will specify answer formats (e.g., explicit algebraic objects, invariants, or canonical representatives), the use of computer-algebra and formal-verification libraries for checking equivalence or isomorphism, and the fact that problems are constructed so that a final verifiable output (rather than a full multi-step proof) can be checked automatically. We will also clarify acceptance criteria for non-canonical but equivalent solutions. These additions will directly support the reliability of the reported solve rates. revision: yes
-
Referee: [Benchmark design and problem selection] Details on problem curation and vetting for originality are insufficient. The central claim depends on the problems being genuinely new and unpublished to avoid data contamination, but the manuscript provides limited information on the expert review process or safeguards against prior publication.
Authors: The referee correctly notes that the current text gives only a brief account of curation. We will expand the relevant section to describe the process in more detail: problems were proposed by domain experts, reviewed internally for correctness and difficulty, and checked for novelty against recent literature and standard databases. All problems were developed specifically for this benchmark and have not been previously published. To preserve the benchmark's utility we will not release the full problem set at this stage, but the added description of the vetting workflow and contamination safeguards will better substantiate our claims. revision: yes
Circularity Check
No circularity: benchmark introduction with direct empirical evaluation
full rationale
The paper introduces FrontierMath as a new collection of original, unpublished mathematics problems and reports an empirical result that current SOTA models solve under 2% of them. No equations, fitted parameters, or derivations are present. The central claim rests on direct model evaluation against the benchmark rather than any self-referential definition, renamed known result, or load-bearing self-citation chain. The automated verification procedure is described at a high level in the abstract but does not reduce the performance statistic to an input by construction. This is a standard benchmark paper whose claims are falsifiable by external replication and therefore self-contained.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 17 Pith papers
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
-
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
-
MathDuels: Evaluating LLMs as Problem Posers and Solvers
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
-
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
-
DeonticBench: A Benchmark for Reasoning over Rules
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
-
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
-
Agentic Frameworks for Reasoning Tasks: An Empirical Study
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
-
Automated Conjecture Resolution with Formal Verification
An AI framework combining informal reasoning and formal verification resolves an open commutative algebra problem and produces a Lean 4-checked proof with minimal human input.
-
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.
-
Riemann-Bench: A Benchmark for Moonshot Mathematics
Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
-
Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
Artificial Intelligence and the Structure of Mathematics
AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.
-
AI for Mathematics: Progress, Challenges, and Prospects
AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.
Reference graph
Works this paper leans on
-
[1]
MSC2020 Mathematics Subject Classification System , author =
-
[2]
Training verifiers to solve math word problems, 2021 , author =
work page 2021
-
[3]
Advances in neural information processing systems , volume=
Mathematical capabilities of chatgpt , author=. Advances in neural information processing systems , volume=
-
[4]
Measuring mathematical problem solving with the math dataset , author =
-
[5]
Math Olympiad Hardness Scale (MOHS) , author =
-
[6]
MOHS was a mistake , author =
-
[7]
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models , author =
-
[8]
Are We Done with MMLU? , author =
-
[9]
Mathematical discoveries from program search with large language models , author =. Nature , publisher =
-
[10]
Minif2f: a cross-system benchmark for formal olympiad-level mathematics , author =
-
[11]
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author =
-
[12]
Arb: Advanced reasoning benchmark for large language models , author =
-
[13]
PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition , author =
-
[14]
Solving olympiad geometry without human demonstrations , author =. Nature , publisher =
-
[15]
Investigating data contamination in modern benchmarks for large language models , author =
-
[16]
Benchmark Data Contamination of Large Language Models: A Survey , author =
-
[17]
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks , author =. 2103.14749 , archiveprefix =
-
[18]
Learning to Reason with LLMs , author =
-
[19]
OpenAI o1-mini , author =
-
[20]
Introducing OpenAI o1-preview , author =
-
[21]
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku , author =
-
[22]
Grok-2 Beta Release , author =
-
[23]
Release notes , author =
-
[24]
Claude 3.5 Sonnet Model Card Addendum , author =
-
[25]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author =. 2310.02255 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding , author =. 2009.03300 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[27]
Missing undergraduate mathematics in mathlib , author =
-
[28]
Theory of Groups of Finite Order , author =
-
[29]
Equality of orders of a set of integers modulo a prime , author =. 1912.02554 , archiveprefix =
-
[30]
Curves over Finite Fields Attaining the Hasse-Weil Upper Bound
Garcia, Arnaldo. Curves over Finite Fields Attaining the Hasse-Weil Upper Bound. European Congress of Mathematics. 2001
work page 2001
-
[31]
The Weil Conjectures for Curves , author =
-
[32]
Aryan Gulati and Brando Miranda and Eric Chen and Emily Xia and Kai Fronsdal and Bruno de Moraes Dumont and Sanmi Koyejo , booktitle=. Putnam-. 2024 , url=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.