pith. sign in

arxiv: 2605.00674 · v2 · pith:KUFHIHH4new · submitted 2026-05-01 · 💻 cs.CL

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Pith reviewed 2026-05-19 18:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM evaluationmathematical reasoningdynamic benchmarksolympiad problemsresearch-level mathformal proofsmodel progress tracking
0
0 comments X

The pith

Expanding MathArena to proofs and research questions shows frontier LLMs solve 98% of 2026 USAMO problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Static benchmarks for LLM math performance saturate quickly and become unreliable for tracking real progress. The paper establishes MathArena as a continuously maintained platform that broadens evaluation to proof-based competitions, arXiv research problems, and Lean formal proofs while introducing fresh benchmarks as models advance. A consistent protocol allows fair comparisons across models. Results indicate the strongest model reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions. This setup matters because it gives a clearer, up-to-date view of whether LLMs are developing genuine mathematical reasoning.

Core claim

MathArena has been extended from final-answer olympiad problems to a broad, maintained platform covering proof-based competitions, research-level arXiv problems, and formal proof generation in Lean, with a fixed evaluation protocol and regular addition of new benchmarks to match improving capabilities, under which the strongest model GPT-5.5 reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions.

What carries the argument

MathArena, a continuously updated evaluation platform that aggregates results across olympiad, proof, research, and formal tasks while adding new problems to stay challenging.

If this is right

  • Models can be compared reliably across a wider variety of mathematical tasks including proofs and research questions.
  • Progress in LLM mathematical abilities can be tracked over time through regular benchmark updates.
  • Evaluation stays relevant as capabilities grow because new problems are introduced when old ones saturate.
  • High accuracy on research-level questions indicates LLMs may soon assist with advanced mathematical work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar maintained platforms could be built for physics or computer science to monitor AI capabilities in those areas.
  • The focus on new benchmarks suggests a community need for private or freshly created problems to avoid contamination.
  • If performance trends continue, LLMs might contribute to solving open research mathematics problems within a few years.

Load-bearing premise

The benchmarks and protocol remain free of training-data contamination and measure genuine reasoning rather than memorization or shortcuts.

What would settle it

A model achieving high scores on MathArena but low scores on a fresh set of never-published math problems of similar difficulty would indicate contamination or overfitting instead of reasoning ability.

Figures

Figures reproduced from arXiv: 2605.00674 by Chenhao Sun, Ivo Petrov, Jasper Dekoninck, K\'ari R\"ognvaldsson, Martin Vechev, Nikola Jovanovi\'c, Tim Gehrunger.

Figure 1
Figure 1. Figure 1: Over the last year, MATHARENA has evolved through repeated benchmark launches (green), human grading (orange), interface improvements (blue), blog posts (purple), and new model evaluations (black). This lifecycle is what distinguishes an evaluation platform from a benchmark. MATHARENA: a platform for mathematical reasoning. The MATHARENA benchmark [9] introduced uncontaminated mathematics competitions as a… view at source ↗
Figure 2
Figure 2. Figure 2: Performance over time. Key takeaways. As shown in view at source ↗
Figure 3
Figure 3. Figure 3: Performance versus cost for all models with sufficient website coverage. Details on how view at source ↗
Figure 4
Figure 4. Figure 4: Absolute score gaps between the final human-validated USAMO 2026 grades and each automated grading setting. Automated proof evaluation is accurate. In view at source ↗
Figure 5
Figure 5. Figure 5: Leave-one-family-out rank robustness for the top 10 models under the full expected view at source ↗
Figure 6
Figure 6. Figure 6: Empirical coverage versus nominal level. Dashed is ideal calibration. 6K 10K 20K 50K 100K Average No. Tokens 15.0% 20.0% 25.0% 30.0% 35.0% Accuracy Agents DeepSeek Math Nomos RSA SelfCheck view at source ↗
Figure 8
Figure 8. Figure 8: The homepage presents the live leaderboard together with benchmark tabs, filters, and view at source ↗
Figure 9
Figure 9. Figure 9: The benchmark index page links each competition to scores, datasets, outputs, and short view at source ↗
Figure 10
Figure 10. Figure 10: The detailed leaderboard adds confidence intervals, token counts, retries, runtime, and view at source ↗
Figure 11
Figure 11. Figure 11: Each model page aggregates benchmark performance, expected rank, pricing, and view at source ↗
Figure 12
Figure 12. Figure 12: The comparison view places two models side by side to highlight differences in accuracy view at source ↗
Figure 13
Figure 13. Figure 13: Problem pages show the prompt, the verified answer, run-by-run outputs, and the full view at source ↗
Figure 14
Figure 14. Figure 14: The surprising-traces view surfaces high-loss failures that deviate strongly from a model’s view at source ↗
Figure 15
Figure 15. Figure 15: The blog page collects benchmark launches and qualitative analyses to document notable view at source ↗
read the original abstract

Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean. Additionally, we maintain a clear evaluation protocol for all models and regularly design new benchmarks as model capabilities improve to ensure that MathArena remains challenging. Notably, the strongest model, GPT-5.5, now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions, showing that frontier models can now comfortably solve extremely challenging mathematical problems. This highlights the importance of continuously maintained evaluation platforms like MathArena to track the rapid progress of LLMs in mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MathArena as an expanded, continuously maintained evaluation platform for LLM mathematical reasoning. It broadens the original olympiad-focused benchmark to encompass proof-based competitions, research-level arXiv problems, and formal proof generation in Lean, while maintaining a fixed evaluation protocol and introducing new benchmarks as capabilities advance. The central empirical claim is that GPT-5.5 reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions.

Significance. If the reported performance numbers rest on genuinely novel, uncontaminated problems that measure reasoning rather than memorization, the work would provide valuable evidence of rapid progress in frontier models' mathematical capabilities and would strengthen the case for dynamic, maintained evaluation platforms over static benchmarks.

major comments (2)
  1. [Abstract / Evaluation Protocol] Abstract and the section describing the evaluation protocol: the headline results (GPT-5.5 at 98% on 2026 USAMO and 74% on research-level arXiv problems) are presented as evidence that frontier models now solve extremely challenging problems, yet no concrete description is given of problem sourcing, leakage detection against common pre-training corpora, or statistical controls for benchmark-specific shortcuts; without these the numbers cannot be interpreted as genuine reasoning gains.
  2. [Benchmark Maintenance] The subsection on benchmark maintenance and new benchmark design: the claim that benchmarks are regularly updated 'as model capabilities improve' is central to the platform's value, but the manuscript supplies no explicit criteria, sourcing pipeline, or verification steps that would prevent post-hoc selection or distributional overlap with training data.
minor comments (1)
  1. [Results] Tables or figures reporting per-model, per-task accuracies would benefit from explicit error bars or sample-size information to allow readers to assess the stability of the 98% and 74% figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments correctly identify areas where the manuscript would benefit from greater transparency on problem sourcing, contamination controls, and benchmark maintenance procedures. We have revised the manuscript to incorporate these details and address the concerns point by point below.

read point-by-point responses
  1. Referee: [Abstract / Evaluation Protocol] Abstract and the section describing the evaluation protocol: the headline results (GPT-5.5 at 98% on 2026 USAMO and 74% on research-level arXiv problems) are presented as evidence that frontier models now solve extremely challenging problems, yet no concrete description is given of problem sourcing, leakage detection against common pre-training corpora, or statistical controls for benchmark-specific shortcuts; without these the numbers cannot be interpreted as genuine reasoning gains.

    Authors: We agree that the original manuscript provided insufficient detail on these aspects, which limits interpretability of the headline results. In the revised version we have expanded the evaluation protocol section with: (1) sourcing information stating that 2026 USAMO problems come from official post-2025 contest releases and research problems are drawn exclusively from arXiv preprints dated after the training cutoffs of all evaluated models; (2) a description of our leakage detection pipeline, which combines n-gram overlap checks against known public pre-training corpora with embedding-based semantic similarity searches; and (3) statistical controls including performance on a set of synthetically generated control problems designed to detect shortcut reliance. These additions allow readers to assess the degree to which results reflect reasoning rather than memorization. revision: yes

  2. Referee: [Benchmark Maintenance] The subsection on benchmark maintenance and new benchmark design: the claim that benchmarks are regularly updated 'as model capabilities improve' is central to the platform's value, but the manuscript supplies no explicit criteria, sourcing pipeline, or verification steps that would prevent post-hoc selection or distributional overlap with training data.

    Authors: We acknowledge the manuscript originally lacked explicit criteria and verification steps for benchmark maintenance. The revised subsection now specifies: new benchmarks are sourced only from competitions and arXiv papers released after the most recent model training cutoffs; each problem undergoes expert review for novelty and rigor; and distributional overlap is checked via cosine similarity on sentence embeddings against both training corpora and prior MathArena items. The sourcing pipeline is described as a combination of automated retrieval from official archives followed by manual curation to avoid post-hoc selection. These changes directly address the risk of contamination and selective reporting. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical platform description with direct evaluations

full rationale

The paper presents MathArena as a continuously maintained evaluation platform that broadens an existing benchmark to include olympiad problems, research-level arXiv questions, and Lean proofs, while reporting direct model performance numbers such as GPT-5.5 reaching 98% on 2026 USAMO and 74% on research questions. No derivations, equations, fitted parameters, or self-referential reductions appear in the provided text; the results are presented as outcomes of running models on the described tasks under a maintained protocol. The work is therefore self-contained with no load-bearing steps that collapse to author-defined inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The platform's value rests on the domain assumption that standardized protocols can fairly compare models across diverse math tasks and that newly added benchmarks remain uncontaminated by training data.

axioms (1)
  • domain assumption Standardized evaluation protocols can produce comparable scores across different LLMs on mathematical tasks.
    Invoked when claiming that GPT-5.5 and other models can be ranked reliably on the expanded set of benchmarks.

pith-pipeline@v0.9.0 · 5781 in / 1337 out tokens · 85426 ms · 2026-05-19T18:14:31.031764+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

    stat.ML 2026-05 unverdicted novelty 7.0

    CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.

  2. When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME...

  3. You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

    cs.LG 2026-05 unverdicted novelty 6.0

    RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.

  4. Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

    cs.AI 2026-05 unverdicted novelty 6.0

    Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.

  5. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

    cs.LG 2026-05 unverdicted novelty 6.0

    Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

  6. Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...

  7. Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

    cs.CL 2026-05 unverdicted novelty 5.0

    Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.

  8. BFLA: Block-Filtered Long-Context Attention Mechanism

    eess.SP 2026-05 unverdicted novelty 4.0

    BFLA is a two-stage block-filtered sparse prefill attention mechanism that constructs an input-dependent block mask and applies tile-level rescues to skip unimportant KV tiles while preserving exact attention inside r...

  9. A Survey of Audio Reasoning in Multimodal Foundation Models

    eess.AS 2026-05 unverdicted novelty 2.0

    A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · cited by 9 Pith papers · 5 internal anchors

  1. [6]

    Jordan, Joseph E

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett,...

  2. [7]

    Ai model & api providers analysis

    Artificial Analysis. Ai model & api providers analysis. https://artificialanalysis.ai/,

  3. [8]

    Accessed: 2026-04-13

  4. [9]

    Benchmarks.https://www.vals.ai/benchmarks, 2026

    Vals AI. Benchmarks.https://www.vals.ai/benchmarks, 2026. Accessed: 2026-04-13

  5. [10]

    Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovic, and Martin T. Vechev. Matharena: Evaluating llms on uncontaminated math competitions.CoRR, abs/2505.23281,

  6. [12]

    The lean theorem prover (system description)

    Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. The lean theorem prover (system description). In Amy P. Felty and Aart Middeldorp, editors,Automated Deduction - CADE-25 - 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings, Lecture Notes in Computer Science, p...

  7. [13]

    Gpt-5.5 system card

    OpenAI. Gpt-5.5 system card. Technical report, OpenAI, April 2026. URL https://openai. com/index/gpt-5-5-system-card/. Accessed: 2026-04-27

  8. [14]

    Deepseek-v4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, 2026. URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/ blob/main/DeepSeek_V4.pdf. Released April 24, 2026. Accessed: 2026-04-29

  9. [15]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

  10. [16]

    Measuring mathematical problem solv- ing with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solv- ing with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2...

  11. [17]

    URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html

  12. [18]

    Omni-math: A universal olympiad level mathematic benchmark for large language models

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Cheng- hao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shang- haoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models. InThe Thirtee...

  13. [19]

    Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceed...

  14. [20]

    doi: 10.18653/v1/2024.acl-long

    Association for Computational Linguistics, 2024. doi: 10.18653/V1/2024.ACL-LONG

  15. [21]

    URLhttps://doi.org/10.18653/v1/2024.acl-long.211

  16. [22]

    Ugmathbench: A diverse and dynamic benchmark for undergraduate-level mathematical reasoning with large language models

    Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, and Can Yang. Ugmathbench: A diverse and dynamic benchmark for undergraduate-level mathematical reasoning with large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/fo...

  17. [23]

    Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P

    Jingxuan Fan, Sarah Martinson, Erik Y . Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P. Brenner. Hardmath: A benchmark dataset for challenging problems in applied mathematics. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. UR...

  18. [30]

    Mathematical capabilities of chatgpt

    Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

  19. [32]

    Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning. In Christos Christodoulopou...

  20. [37]

    Chawla, and Xiangliang Zhang

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,

  21. [38]

    URLhttps://openreview.net/forum?id=3GTtZFiajM

  22. [42]

    Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition

    George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jen- nings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, edi- tors,Advances in ...

  23. [44]

    minif2f: a cross-system benchmark for formal olympiad-level mathematics

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv

  24. [47]

    Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026

    Nikil Ravi, Langston Nashold, and Rayan Krishnan. Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026. URL https://www.vals.ai/ benchmarks/proof_bench. Accessed: 2026-04-29

  25. [50]

    Aristotle: IMO-level Automated Theorem Proving

    Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, and La...

  26. [56]

    Sciencebench: Project benchmarks

    Christian Stump. Sciencebench: Project benchmarks. https://math.sciencebench.ai/ benchmarks, March 2026. Published March 23, 2026. Accessed April 8, 2026

  27. [64]

    Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. Abstentionbench: Reasoning llms fail on unanswerable questions.CoRR, abs/2506.09038, 2025. doi: 10.48550/ ARXIV .2506.09038. URLhttps://doi.org/10.48550/arXiv.2506.09038

  28. [65]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  29. [66]

    URLhttps://openreview.net/forum?id=KUNzEQMWU7

  30. [67]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. In Amir Globersons, Lester Mackey, Danielle Belgrave, An- gela Fan, Ulrich Paquet, Jakub M. Tomczak, and Ch...

  31. [68]

    Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/forum?id=V...

  32. [73]

    Project euler, 2025

    Project Euler. Project euler, 2025. URLhttps://projecteuler.net/. Accessed: 2025

  33. [75]

    American Invitational Mathematics Examination

    Mathematical Association of America. American Invitational Mathematics Examination. https://maa.org/maa-invitational-competitions/, 2026. Accessed: 2026-04-09

  34. [76]

    Harvard-mit mathematics tournament, 2025

    HMMT. Harvard-mit mathematics tournament, 2025. URL https://www.hmmt.org/. Ac- cessed: 2025

  35. [77]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. In Minsu Cho, Ivan Laptev, Du Tran, Angela Yao, and Hongbin Zha, editors,Computer Vision - ACCV 2024 - 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8-12, 2024, Proceedings, Part V, Lecture Notes in Computer Science, p...

  36. [78]

    Kangourou sans Frontières

    Kangourou sans Frontières. Kangourou sans Frontières. https://www.aksf.org/, 2026. Accessed: 2026-04-09

  37. [79]

    Miklós Schweitzer Memorial Competition

    Bolyai János Mathematical Society. Miklós Schweitzer Memorial Competition. https: //www.bolyai.hu/versenyek-schweitzer-miklos-emlekverseny/ , 2026. Accessed: 2026- 04-09

  38. [80]

    William Lowell Putnam Mathematical Competition

    Mathematical Association of America. William Lowell Putnam Mathematical Competition. https://maa.org/putnam/, 2026. Accessed: 2026-04-09

  39. [81]

    International Mathematics Competition for University Students

    International Mathematics Competition for University Students. International Mathematics Competition for University Students. https://www.imc-math.org.uk/, 2026. Accessed: 2026-04-09

  40. [83]

    comparator, 2026

    leanprover. comparator, 2026. URL https://github.com/leanprover/comparator. GitHub repository, retrieved 2026-04-17

  41. [84]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, April 2025. URL https://openai.com/index/o3-o4-mini-system-card/. Accessed: 2026-04-09

  42. [86]

    Openai o3-mini system card

    OpenAI. Openai o3-mini system card. Technical report, OpenAI, January 2025. URL https://openai.com/index/o3-mini-system-card/. Accessed: 2026-04-09

  43. [88]

    Update to gpt-5 system card: Gpt-5.2

    OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, OpenAI, Decem- ber 2025. URL https://openai.com/index/gpt-5-system-card-update-gpt-5-2/ . Ac- cessed: 2026-04-09

  44. [89]

    Gpt-5.4 thinking system card

    OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https://openai.com/index/gpt-5-4-thinking-system-card/. Accessed: 2026-04-09

  45. [90]

    Grok 4.xAI News, July 2025

    xAI. Grok 4.xAI News, July 2025. URL https://x.ai/news/grok-4. Accessed: 2026-04-09

  46. [91]

    Gemini 3 pro - model card

    Google. Gemini 3 pro - model card. Technical report, Google, Decem- ber 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf. Accessed: 2026-04-09

  47. [92]

    Gemini 3.1 pro model card

    Google. Gemini 3.1 pro model card. Technical report, Google, Febru- ary 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf. Accessed: 2026-04-09

  48. [95]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  49. [97]

    Marah I Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat S. Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safo...

  50. [100]

    Gemini 2.5 deep think model card

    Google. Gemini 2.5 deep think model card. Technical report, Google, Au- gust 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Deep-Think-Model-Card.pdf. Accessed: 2026-04-09

  51. [101]

    System card: Claude mythos preview

    Anthropic. System card: Claude mythos preview. Technical report, Anthropic, April 2026. URL https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841. pdf. Accessed: 2026-04-30

  52. [102]

    Smith, Mateusz Paprocki, Ondrej Certik, Sergey B

    Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondrej Certík, Sergey B. Kirpichev, Matthew Rocklin, Amit Kumar, Sergiu Ivanov, Jason Keith Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Stepán ...

  53. [106]

    System card: Claude opus 4.6

    Anthropic. System card: Claude opus 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf. Accessed: 2026-04-09

  54. [107]

    The lean mathematical library

    The mathlib Community. The lean mathematical library. In Jasmin Blanchette and Catalin Hritcu, editors,Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, New Orleans, LA, USA, January 20-21, 2020, pages 367–

  55. [108]

    The Lean mathematical library

    ACM, 2020. doi: 10.1145/3372885.3373824. URLhttps://doi.org/10.1145/3372885. 3373824

  56. [109]

    AXLE: Axiom Lean Engine

    Axiom Math. AXLE: Axiom Lean Engine. https://axle.axiommath.ai/, 2026. Accessed: 2026-04-10

  57. [110]

    loogle: Mathlib search tool

    Joachim Breitner. loogle: Mathlib search tool. https://github.com/nomeata/loogle, 2026. GitHub repository. Accessed: 2026-04-10

  58. [111]

    Leanexplore: A search engine for lean 4 declarations.CoRR, abs/2506.11085,

    Justin Asher. Leanexplore: A search engine for lean 4 declarations.CoRR, abs/2506.11085,

  59. [113]

    Gemini 3 flash model card

    Google. Gemini 3 flash model card. Technical report, Google, Decem- ber 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Flash-Model-Card.pdf. Accessed: 2026-04-09

  60. [115]

    Dickerson

    Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan S. Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, and Moksh Jain. Recursive self-aggregation unlocks deep thinking in large language models.CoRR, abs/2509.26626, 2025. doi: 10.48550/ARXIV .2509.26626. URL https://doi.o...

  61. [116]

    Try to prove the following statement: {perturbed_statement}

    Roger Jin, Jeffrey Quesnelle, Dakota Mahan, Chen Guang, Ryan Teknium, Jun Park, Ibrakhim Ustelbay, Samuel Kim, Miron Yurkevich, Adilet Zauytkhan, Rinat Amankos, Alex Andreyev, Damir Nurlanov, Abuzer Abuov, and Askar massiveaxe. Nomos. https://github.com/ NousResearch/nomos, 2025. GitHub repository. 18 A Benchmark Details Table 4: Benchmarks included in MA...

  62. [117]

    Chain A: [idea] / Chain B: [idea] /

    **Checkpoints (7 pts total)** * Break the solution into logically independent checkpoints with **integer** point values. * Allocate **>= 4 pts** to the main idea/critical steps (all combined); **<= 3 pts** to routine work (all combined).,→ * If a specific checkpoint can be achieved in multiple mathematically equivalent ways, mention this. However, only do...

  63. [118]

    **Zero-credit items** * List common arguments or observations that **earn 0 points** (e.g., conjectures without proof, especially in geometry, routine restatements of the problem, or dead-ends).,→ * Avoid redundancy with deductions below

  64. [119]

    cap at x/7

    **Deductions** * Bullet each typical/expected mistake with a **flat** penalty (**1**, **2**, or **"cap at x/7"**). Use caps for major problems, while subtraction should be used for smaller arithmetic errors. Try to anticipate likely errors based on the official solution structure. ,→ ,→ * Target logic gaps, invalid claims, circular reasoning, or contradic...

  65. [120]

    **Mathematical validity** of the proof's reasoning and conclusion

  66. [121]

    **Problem constraints** (e.g., unique required final value; forbidden tools if stated)

  67. [122]

    brilliant

    **Advisory mapping to the marking scheme** (checkpoints/deductions), allowing different orders and techniques. **Alternative-approach policy:** - If the proof uses a different but valid method, **map its steps to equivalent rubric checkpoints** (same logical role) and award points accordingly.,→ - Apply zero-credit items/deductions **only when the underly...

  68. [123]

    action":

    specific error 2, ... </errors> ### INPUT DATA ## Problem Statement ## {problem_statement} ## Marking Scheme ## {guidelines} ## Proof Solution ## {student_answer} Judge Proof Reconciliation {judge_prompt} You are additionally given the following judgments from other judges on the same proof. They have all read the same proof and marking scheme, but they s...

  69. [124]

    **Direct derivability** The answer must be derivable *directly and unambiguously* from the abstract alone, without requiring access to the full paper or external references.,→

  70. [125]

    **Main contribution** The question must target a *primary theorem, result, or quantitative claim* of the paper, not background material, motivation, or related work.,→

  71. [126]

    **Unambiguous and objective** The question must have exactly **one correct answer**, with no dependence on interpretation, conventions, or unstated assumptions.,→

  72. [127]

    significant,

    **Non-subjective** The question must not involve opinions, qualitative judgments, or vague descriptors (e.g., "significant," "large," "efficient").,→

  73. [128]

    Additionally, avoid logical expressions and inequalities

    **Answer format constraint** The answer must be **either**: - a single numerical value, or - a pure LaTeX mathematical expression The answer **must not contain any English words**, including within LaTeX (symbols and variables are allowed). Additionally, avoid logical expressions and inequalities. Focus on functions, constants, formulas, or specific mathe...

  74. [129]

    **Question type restriction** The question must **not** be: - yes/no - multiple-choice - a request to prove or explain something

  75. [130]

    **Machine-verifiable** The answer must be suitable for **rule-based verification**, meaning it can be extracted and compared as a string or parsed LaTeX expression.,→

  76. [131]

    in this work

    **Self-contained** The question must be understandable *on its own*. - Do **not** reference the paper, authors, or phrases like "in this work." - All notation and quantities used must be explicitly defined in the question

  77. [132]

    **No paper references in the answer** The answer must be a standalone mathematical object and must not refer to the paper, its results, or its statements.,→

  78. [133]

    **Claim needs to be proven** The authors must say they have actually proven or established the claim in the paper, not just stated it as a conjecture or open problem.,→

  79. [134]

    In particular, all variables, notation, and quantities used in the question must be explicitly defined within the question itself

    **All context provided** Ensure the question contains all necessary context from the abstract to be answerable. In particular, all variables, notation, and quantities used in the question must be explicitly defined within the question itself. ,→ ,→ It is okay if questions are long, as long as they remain clear and unambiguous. 43

  80. [135]

    stated result

    **Be careful with bounds** Some papers prove bounds or inequalities. These are acceptable only if the bound is stated to be tight or exact in the abstract, so that there is a unique correct answer.,→ Otherwise, such abstracts should be rejected. --- ## Examples of Unacceptable Questions - A question that is very easy, and clearly not the main contribution...

Showing first 80 references.