arxiv: 2109.00110 · v2 · submitted 2021-08-31 · 💻 cs.AI · cs.FL· cs.LG

Recognition: 1 theorem link

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Kunhao Zheng , Jesse Michael Han , Stanislas Polu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:59 UTC · model grok-4.3

classification 💻 cs.AI cs.FLcs.LG

keywords neural theorem provingformal mathematicsbenchmark datasetOlympiad problemsMetamathLeanIsabelle

0 comments

The pith

MiniF2F supplies 488 formal statements from math contests as a shared benchmark for neural theorem provers across systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents miniF2F, a dataset of formal statements drawn from AIME, AMC, IMO and high-school courses. It targets Metamath, Lean, Isabelle and HOL Light to enable direct comparisons of neural provers. Baseline results are given using GPT-f, a GPT-3 based prover, and the dataset is offered as a community resource to measure progress.

Core claim

miniF2F consists of 488 problem statements translated into formal languages from several proof systems, providing a unified cross-system benchmark for neural theorem proving on Olympiad-level mathematics.

What carries the argument

The miniF2F dataset of 488 translated formal problem statements that standardizes evaluation across different interactive theorem proving systems.

Load-bearing premise

The chosen contest problems translate accurately into correct formal statements that form a stable and representative measure of progress.

What would settle it

If independent formalizations of the same problems yield inconsistent solvability rates for the same neural prover, or if high performance on miniF2F fails to predict results on new contest problems.

read the original abstract

We present miniF2F, a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark currently targets Metamath, Lean, Isabelle (partially) and HOL Light (partially) and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO), as well as material from high-school and undergraduate mathematics courses. We report baseline results using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of its performance. We intend for miniF2F to be a community-driven effort and hope that our benchmark will help spur advances in neural theorem proving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiniF2F supplies the first shared set of Olympiad-level formal statements across provers, with usable baselines, but the translation checks are the part that needs referee attention.

read the letter

The main takeaway is that this paper ships a new benchmark of 488 contest problems formalized in Metamath, Lean, and partially in Isabelle and HOL Light, plus GPT-f numbers that show how far current models still are from solving them. That cross-system coverage is the real addition; earlier work stayed inside one prover or used easier problems, so direct comparisons were hard. The construction from AIME, AMC, and IMO sources is straightforward and the baseline analysis gives a clear starting line for the field. Credit to the authors for releasing it openly and framing it as community-driven. The central claim holds up on its own terms: the dataset exists and the reported results are reproducible from the stated setup. The soft spot is the formalization accuracy. The abstract and stress-test note both leave open how the statements were validated against the English originals or whether systematic mismatches slipped in. If the full paper only has spot checks rather than a documented audit, that weakens the unified-benchmark claim, because later papers would be solving altered versions of the problems. The partial coverage for Isabelle and HOL Light is minor but worth noting as a current limit rather than a finished product. This is squarely for people building or evaluating neural theorem provers who need a harder, standardized test set. A reader already working in the area will get immediate value from the dataset itself. It is coherent and grounded enough to deserve a serious referee who can verify the translation process and the baseline setup rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The paper introduces miniF2F, a benchmark dataset of 488 formal statements of Olympiad-level mathematics problems (drawn from AIME, AMC, IMO, and high-school/undergraduate sources) formalized in Metamath, Lean, Isabelle (partial), and HOL Light (partial). It reports baseline performance using the GPT-f neural theorem prover and positions the resource as a community-driven, cross-system evaluation tool for neural theorem proving.

Significance. If the formalizations are verifiably correct and semantically equivalent to the source problems, miniF2F would supply a much-needed standardized, challenging benchmark that enables direct comparison of neural provers across proof assistants. The provision of baselines and the explicit invitation for community extensions are practical strengths that could accelerate progress in the area.

major comments (1)

[Dataset construction] Dataset construction section: the manuscript provides no description of a validation procedure, cross-system consistency checks, or error audit for the 488 formal statements. Without reported verification against the original English contest problems, it is impossible to confirm that the statements are accurate translations rather than altered problems, which directly undermines the claim of a reliable unified benchmark.

minor comments (2)

[Introduction / Dataset] The partial coverage of Isabelle and HOL Light should be quantified (e.g., exact number of statements per system) and any systematic differences in formalization style across systems should be noted.
[Experiments] Baseline results would benefit from a clearer breakdown of success rates by problem source (AIME vs. IMO) and by system, rather than aggregate figures only.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for their constructive feedback and recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the manuscript provides no description of a validation procedure, cross-system consistency checks, or error audit for the 488 formal statements. Without reported verification against the original English contest problems, it is impossible to confirm that the statements are accurate translations rather than altered problems, which directly undermines the claim of a reliable unified benchmark.

Authors: We agree with the referee that the current manuscript does not provide a detailed account of the validation procedure for the formal statements. The statements were formalized by the authors with domain expertise in each proof assistant, and each was manually verified for semantic equivalence to the original English problems. However, to enhance clarity and address this concern, we will revise the Dataset construction section to include a comprehensive description of the validation steps, cross-system consistency checks, and error auditing process. This will include specifics on how the formalizations were reviewed and confirmed. revision: yes

Circularity Check

0 steps flagged

Dataset release with empirical baselines exhibits no circularity

full rationale

The paper introduces miniF2F as a new benchmark dataset of 488 formal problem statements drawn from external contest sources and reports direct empirical baselines using the existing GPT-f prover. No derivations, fitted parameters, predictions, or uniqueness theorems are claimed. The central contribution is the dataset itself, which is externally verifiable against the original AIME/AMC/IMO problems, and the reported results are straightforward performance measurements without any self-referential reduction to inputs or self-citation chains. This is a standard dataset paper whose claims rest on the external correctness of the translations rather than any internal circular construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the assumption that contest problems can be faithfully formalized and that the resulting statements constitute a meaningful test of neural theorem provers; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Selected AIME, AMC, and IMO problems can be accurately translated into correct formal statements in Metamath, Lean, Isabelle, and HOL Light.
The benchmark value depends on the fidelity of these translations.

pith-pipeline@v0.9.0 · 5423 in / 1261 out tokens · 48223 ms · 2026-05-16T19:59:49.058006+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MathAtlas: A Benchmark for Autoformalization in the Wild
cs.AI 2026-05 accept novelty 8.0

MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
cs.LG 2026-05 unverdicted novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
cs.AI 2026-05 unverdicted novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
cs.AI 2026-05 unverdicted novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling...
Faithful Autoformalization via Roundtrip Verification and Repair
cs.CL 2026-04 unverdicted novelty 7.0

Roundtrip verification plus diagnosis-guided repair boosts LLM formalization equivalence from 45-61% to 83-85% on 150 traffic rules without ground-truth labels.
Faithful Autoformalization via Roundtrip Verification and Repair
cs.CL 2026-04 unverdicted novelty 7.0

Roundtrip verification with diagnosis-guided repair improves faithful autoformalization of statutory text by LLMs, where failing equivalence checks correlate with 1.4x-2.5x higher NLI drift than passing ones.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
cs.CL 2026-04 unverdicted novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
cs.CL 2024-10 conditional novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
cs.AI 2026-05 unverdicted novelty 6.0

Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
ProofSketcher: Hybrid LLM + Lightweight Proof Checker for Reliable Math/Logic Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

A hybrid pipeline lets an LLM write high-level proof sketches in a compact DSL that a lightweight kernel then expands into explicit, checkable obligations for reliable math and logic reasoning.
A Minimal Agent for Automated Theorem Proving
cs.AI 2026-02 unverdicted novelty 6.0

A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving
cs.LG 2026-04 unverdicted novelty 5.0

OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.
pAI/MSc: ML Theory Research with Humans on the Loop
cs.AI 2026-04 unverdicted novelty 5.0

pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript dra...
Learning to Reason with Insight for Informal Theorem Proving
cs.AI 2026-04 unverdicted novelty 5.0

A new dataset structuring proofs by core techniques plus progressive multi-stage fine-tuning lets LLMs outperform baselines on informal theorem-proving benchmarks.
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
eess.SP 2026-04 unverdicted novelty 4.0

Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
AI for Mathematics: Progress, Challenges, and Prospects
math.HO 2026-01 unverdicted novelty 4.0

AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 17 Pith papers · 2 internal anchors

[1]

Holist: An envi- ronment for machine learning of higher order logic theorem proving

Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. Holist: An envi- ronment for machine learning of higher order logic theorem proving. InInternational Conference on Machine Learning, pp. 454–463. PMLR, 2019a. Kshitij Bansal, Christian Szegedy, Markus N Rabe, Sarah M Loos, and Viktor Toman. Learning to reason in large theories...

work page arXiv 1905
[2]

Kevin Buzzard, Johan Commelin, and Patrick Massot

URL https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Kevin Buzzard, Johan Commelin, and Patrick Massot. Lean perfectoid spaces. https:// leanprover-community.github.io/lean-perfectoid-spaces/ ,

work page 2020
[3]

Imagenet: A large-scale hi- erarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

work page 2009
[4]

Proof artifact co-training for theorem proving with language models

Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W Ayers, and Stanislas Polu. Proof artifact co-training for theorem proving with language models. arXiv preprint arXiv:2102.06203,

work page arXiv
[5]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The lean mathematical library

The mathlib Community. The lean mathematical library. In Jasmin Blanchette and Catalin Hritcu (eds.), Proceedings of the 9th ACM SIGPLAN International Conference on Certiﬁed Programs and Proofs, CPP 2020, New Orleans, LA, USA, January 20-21, 2020 , pp. 367–381. ACM,

work page 2020
[7]

URL https://doi.org/10.1145/3372885

doi: 10.1145/3372885.3373824. URL https://doi.org/10.1145/3372885. 3373824. 8 Published as a conference paper at ICLR 2022 Aditya Paliwal, Sarah Loos, Markus Rabe, Kshitij Bansal, and Christian Szegedy. Graph represen- tations for higher-order logic and theorem proving. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 2967–2974,

work page doi:10.1145/3372885.3373824 2022
[8]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Be...

work page 2016
[9]

doi: 10.18653/v1/ 2021.acl-srw.23

doi: 10.18653/v1/ p16-1144. URL https://doi.org/10.18653/v1/p16-1144. Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393,

work page doi:10.18653/v1/ 2009
[10]

Squad: 100, 000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In Jian Su, Xavier Carreras, and Kevin Duh (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016 , pp. 2383–2392. The Association for C...

work page 2016
[11]

SQ u AD : 100,000+ Questions for Machine Comprehension of Text

doi: 10.18653/v1/d16-1264. URL https://doi.org/10.18653/v1/ d16-1264. David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. In 7th International Conference on Learning Represen- tations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net,

work page doi:10.18653/v1/d16-1264 2019
[12]

Peter Scholze

URL https://openreview.net/forum?id=H1gR5iR5FX. Peter Scholze. Liquid tensor experiment.https://xenaproject.wordpress.com/2020/ 12/05/liquid-tensor-experiment/,

work page 2020
[13]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net,

work page 2019
[14]

Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho

URL https://openreview.net/forum?id= rJ4km2R5t7. Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. Naturalproofs: Mathematical theorem proving in natural language. arXiv preprint arXiv:2104.01112,

work page arXiv
[16]

Holophrasm: a neural Automated Theorem Prover for higher-order logic

URL http://arxiv.org/abs/1608.02644. Freek Wiedijk. Formalizing 100 theorems. https://www.cs.ru.nl/˜freek/100/,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

INT: an inequality benchmark for evaluating generalization in theorem proving

Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Baker Grosse. INT: an inequality benchmark for evaluating generalization in theorem proving. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

work page 2021
[18]

The proof is optionally attached thus not part of the benchmark

9 Published as a conference paper at ICLR 2022 A E XAMPLE OF STATEMENT IN MINI F2F Table 4: Problem 11 of 2000 AMC 12 is formalized with proof in different languages in miniF2F. The proof is optionally attached thus not part of the benchmark. The proof in Metamath is too long to be fully displayed. Natural Language Two non-zero real numbers, a and b, sati...

work page 2022
[19]

a \<noteq> 0

(h1 : a * b = a - b) : a / b + b / a - a * b = 2 := begin field simp [h 0.1, h 0.2], simp only [h 1, mul comm, mul sub], ring, end Isabelle theorem amc12 2000 p11: fixes a b::real assumes "a \<noteq> 0" "b \<noteq> 0" and "a * b = a - b" shows "a / b + b / a - a * b = 2" using assms by (smt (verit, ccfv threshold) diff divide distrib div self divide divid...

work page 2000