Recognition: unknown
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
Pith reviewed 2026-05-09 19:15 UTC · model grok-4.3
The pith
MathArena turns static math benchmarks into a continuously updated platform that adds new tasks to track LLM reasoning as capabilities advance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MathArena expands from final-answer olympiad problems into a maintained platform covering proof-based competitions, research-level arXiv problems, and Lean formal proof generation, with a fixed evaluation protocol and ongoing addition of new benchmarks to keep the suite challenging. The strongest model achieves 98 percent on the 2026 USA Math Olympiad and 74 percent on research-level questions, establishing that frontier LLMs can now solve extremely challenging mathematical problems across informal and formal settings.
What carries the argument
MathArena, a continuously maintained evaluation platform that runs, aggregates, and analyzes LLM performance across a growing set of mathematical tasks with a fixed protocol and periodic addition of new problems to prevent saturation.
If this is right
- Models can be compared consistently over time even after individual benchmarks are solved.
- Regular addition of harder tasks keeps the evaluation relevant to current frontier capabilities.
- High scores on olympiad and research problems support using LLMs as collaborators on difficult mathematics.
- Including both informal and formal tasks allows tracking progress across different styles of mathematical work.
Where Pith is reading between the lines
- Widespread adoption of such platforms could shift focus from creating one-off benchmarks to long-term maintenance of evaluation suites.
- The data may help identify persistent failure modes, such as specific proof techniques or research domains where models still lag.
- Extending the platform to additional formal systems beyond Lean could test whether current high performance generalizes across proof assistants.
Load-bearing premise
The new tasks and evaluation protocol stay fair and free of training-data contamination as models scale and the platform adds more problems.
What would settle it
Detection of training-data overlap in the research-level arXiv problems or a sudden drop in all models' scores on a newly added benchmark would indicate the reported performance no longer reflects genuine reasoning ability.
Figures
read the original abstract
Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean. Additionally, we maintain a clear evaluation protocol for all models and regularly design new benchmarks as model capabilities improve to ensure that MathArena remains challenging. Notably, the strongest model, GPT-5.5, now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions, showing that frontier models can now comfortably solve extremely challenging mathematical problems. This highlights the importance of continuously maintained evaluation platforms like MathArena to track the rapid progress of LLMs in mathematical reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MathArena as an expanded, continuously maintained evaluation platform for LLM mathematical reasoning. It broadens the original olympiad-focused benchmark to encompass proof-based competitions, research-level arXiv problems, and Lean formal proof generation, while stressing the need for regularly updated tasks to avoid saturation of static benchmarks. The work reports that the strongest evaluated model, GPT-5.5, achieves 98% on the 2026 USA Math Olympiad and 74% on research-level questions.
Significance. If the underlying evaluation protocol and benchmark construction can be shown to be robust against contamination, the platform would usefully illustrate the rapid progress of frontier LLMs on advanced mathematics and provide a model for dynamic, domain-specific evaluation systems. The emphasis on ongoing benchmark renewal addresses a genuine limitation of fixed test sets.
major comments (2)
- [Abstract] Abstract: The headline performance claims (98% on 2026 USAMO, 74% on research-level questions) are stated without any description of evaluation methodology, data sources, problem provenance, answer verification procedures, or contamination checks. These details are load-bearing for the central assertion that frontier models now 'comfortably solve extremely challenging mathematical problems.'
- [Platform and protocol sections] Platform description: The manuscript asserts that a 'clear evaluation protocol' is maintained and that new benchmarks are 'regularly designed as model capabilities improve,' yet supplies no concrete mechanisms (e.g., post-cutoff problem generation dates, n-gram or embedding overlap checks against training corpora, or standardized prompting and scoring rules for open-ended proofs).
minor comments (1)
- A summary table listing task categories, number of problems per category, and evaluation metrics would help readers quickly grasp the platform's breadth and the basis for the aggregated performance figures.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for improving the clarity and robustness of our presentation. We have revised the manuscript to address the concerns about methodological details in the abstract and platform sections. Below we respond point by point.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline performance claims (98% on 2026 USAMO, 74% on research-level questions) are stated without any description of evaluation methodology, data sources, problem provenance, answer verification procedures, or contamination checks. These details are load-bearing for the central assertion that frontier models now 'comfortably solve extremely challenging mathematical problems.'
Authors: We agree that the abstract's brevity omits these supporting details, which are central to the claims. The full manuscript describes the sources and protocol in later sections, but we have now revised the abstract to include a concise summary: problems are drawn from official post-2024 competition archives and recent arXiv submissions, verified via automated solvers plus expert review, and screened for contamination through date-based cutoffs and overlap checks. This revision strengthens the abstract without altering its length constraints. revision: yes
-
Referee: [Platform and protocol sections] Platform description: The manuscript asserts that a 'clear evaluation protocol' is maintained and that new benchmarks are 'regularly designed as model capabilities improve,' yet supplies no concrete mechanisms (e.g., post-cutoff problem generation dates, n-gram or embedding overlap checks against training corpora, or standardized prompting and scoring rules for open-ended proofs).
Authors: We acknowledge that the original text referenced the protocol at a high level without enumerating the concrete mechanisms. In the revised manuscript we have expanded the Platform and Evaluation Protocol sections to specify: all problems use post-cutoff dates (2025 onward for competitions and arXiv papers after model training windows); contamination detection employs n-gram overlap thresholds and embedding cosine similarity against public training corpora; prompting is standardized with fixed chain-of-thought templates and few-shot examples; and scoring combines Lean verifiers for formal proofs with rubric-based expert grading for open-ended work. We also detail the saturation-based process for introducing new benchmarks. revision: yes
Circularity Check
No circularity: descriptive platform paper with no derivations, equations, or self-referential predictions
full rationale
The paper is a descriptive account of expanding the MathArena evaluation platform to cover proof-based competitions, research-level arXiv problems, and Lean formal proofs, while maintaining an evaluation protocol and adding new benchmarks. It reports observed performance numbers (e.g., GPT-5.5 at 98% on 2026 USAMO) without any derivation chain, fitted parameters, predictions, or equations. No load-bearing self-citations reduce claims to tautologies; the reference to the original MathArena is contextual background, not a premise that forces the reported results. The concern about benchmark contamination is an external validity issue, not a circular reduction in the paper's own logic. No enumerated circularity patterns apply.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 5 Pith papers
-
CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency
CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
-
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
-
BFLA: Block-Filtered Long-Context Attention Mechanism
BFLA is a two-stage block-filtered sparse prefill attention mechanism that constructs an input-dependent block mask and applies tile-level rescues to skip unimportant KV tiles while preserving exact attention inside r...
Reference graph
Works this paper leans on
-
[6]
Jordan, Joseph E
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett,...
2024
-
[7]
Ai model & api providers analysis
Artificial Analysis. Ai model & api providers analysis. https://artificialanalysis.ai/,
-
[8]
Accessed: 2026-04-13
2026
-
[9]
Benchmarks.https://www.vals.ai/benchmarks, 2026
Vals AI. Benchmarks.https://www.vals.ai/benchmarks, 2026. Accessed: 2026-04-13
2026
-
[10]
Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovic, and Martin T. Vechev. Matharena: Evaluating llms on uncontaminated math competitions.CoRR, abs/2505.23281,
work page internal anchor Pith review arXiv
-
[12]
The lean theorem prover (system description)
Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. The lean theorem prover (system description). In Amy P. Felty and Aart Middeldorp, editors,Automated Deduction - CADE-25 - 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings, Lecture Notes in Computer Science, p...
-
[13]
Gpt-5.5 system card
OpenAI. Gpt-5.5 system card. Technical report, OpenAI, April 2026. URL https://openai. com/index/gpt-5-5-system-card/. Accessed: 2026-04-27
2026
-
[14]
Deepseek-v4: Towards highly efficient million-token context intelligence
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, 2026. URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/ blob/main/DeepSeek_V4.pdf. Released April 24, 2026. Accessed: 2026-04-29
2026
-
[15]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Measuring mathematical problem solv- ing with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solv- ing with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2...
2021
-
[17]
URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html
2021
-
[18]
Omni-math: A universal olympiad level mathematic benchmark for large language models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Cheng- hao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shang- haoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models. InThe Thirtee...
2025
-
[19]
Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceed...
2024
-
[20]
Association for Computational Linguistics, 2024. doi: 10.18653/V1/2024.ACL-LONG
-
[21]
URLhttps://doi.org/10.18653/v1/2024.acl-long.211
-
[22]
Ugmathbench: A diverse and dynamic benchmark for undergraduate-level mathematical reasoning with large language models
Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, and Can Yang. Ugmathbench: A diverse and dynamic benchmark for undergraduate-level mathematical reasoning with large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/fo...
2025
-
[23]
Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P
Jingxuan Fan, Sarah Martinson, Erik Y . Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P. Brenner. Hardmath: A benchmark dataset for challenging problems in applied mathematics. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. UR...
2025
-
[30]
Mathematical capabilities of chatgpt
Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...
2023
-
[32]
Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning. In Christos Christodoulopou...
-
[37]
Chawla, and Xiangliang Zhang
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,
2025
-
[38]
URLhttps://openreview.net/forum?id=3GTtZFiajM
-
[42]
Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition
George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jen- nings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, edi- tors,Advances in ...
2024
-
[44]
minif2f: a cross-system benchmark for formal olympiad-level mathematics
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv
2022
-
[47]
Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026
Nikil Ravi, Langston Nashold, and Rayan Krishnan. Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026. URL https://www.vals.ai/ benchmarks/proof_bench. Accessed: 2026-04-29
2026
-
[50]
Aristotle: Imo-level automated theorem proving.arXiv preprint arXiv:2510.01346,
Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, and La...
-
[56]
Sciencebench: Project benchmarks
Christian Stump. Sciencebench: Project benchmarks. https://math.sciencebench.ai/ benchmarks, March 2026. Published March 23, 2026. Accessed April 8, 2026
2026
-
[64]
Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. Abstentionbench: Reasoning llms fail on unanswerable questions.CoRR, abs/2506.09038, 2025. doi: 10.48550/ ARXIV .2506.09038. URLhttps://doi.org/10.48550/arXiv.2506.09038
-
[65]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,
2024
-
[66]
URLhttps://openreview.net/forum?id=KUNzEQMWU7
-
[67]
Charxiv: Charting gaps in realistic chart understanding in multimodal llms
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. In Amir Globersons, Lester Mackey, Danielle Belgrave, An- gela Fan, Ulrich Paquet, Jakub M. Tomczak, and Ch...
2024
-
[68]
Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/forum?id=V...
2025
-
[73]
Project euler, 2025
Project Euler. Project euler, 2025. URLhttps://projecteuler.net/. Accessed: 2025
2025
-
[75]
American Invitational Mathematics Examination
Mathematical Association of America. American Invitational Mathematics Examination. https://maa.org/maa-invitational-competitions/, 2026. Accessed: 2026-04-09
2026
-
[76]
Harvard-mit mathematics tournament, 2025
HMMT. Harvard-mit mathematics tournament, 2025. URL https://www.hmmt.org/. Ac- cessed: 2025
2025
-
[77]
Vision language models are blind
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. In Minsu Cho, Ivan Laptev, Du Tran, Angela Yao, and Hongbin Zha, editors,Computer Vision - ACCV 2024 - 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8-12, 2024, Proceedings, Part V, Lecture Notes in Computer Science, p...
-
[78]
Kangourou sans Frontières
Kangourou sans Frontières. Kangourou sans Frontières. https://www.aksf.org/, 2026. Accessed: 2026-04-09
2026
-
[79]
Miklós Schweitzer Memorial Competition
Bolyai János Mathematical Society. Miklós Schweitzer Memorial Competition. https: //www.bolyai.hu/versenyek-schweitzer-miklos-emlekverseny/ , 2026. Accessed: 2026- 04-09
2026
-
[80]
William Lowell Putnam Mathematical Competition
Mathematical Association of America. William Lowell Putnam Mathematical Competition. https://maa.org/putnam/, 2026. Accessed: 2026-04-09
2026
-
[81]
International Mathematics Competition for University Students
International Mathematics Competition for University Students. International Mathematics Competition for University Students. https://www.imc-math.org.uk/, 2026. Accessed: 2026-04-09
2026
-
[83]
comparator, 2026
leanprover. comparator, 2026. URL https://github.com/leanprover/comparator. GitHub repository, retrieved 2026-04-17
2026
-
[84]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, April 2025. URL https://openai.com/index/o3-o4-mini-system-card/. Accessed: 2026-04-09
2025
-
[86]
Openai o3-mini system card
OpenAI. Openai o3-mini system card. Technical report, OpenAI, January 2025. URL https://openai.com/index/o3-mini-system-card/. Accessed: 2026-04-09
2025
-
[88]
Update to gpt-5 system card: Gpt-5.2
OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, OpenAI, Decem- ber 2025. URL https://openai.com/index/gpt-5-system-card-update-gpt-5-2/ . Ac- cessed: 2026-04-09
2025
-
[89]
Gpt-5.4 thinking system card
OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https://openai.com/index/gpt-5-4-thinking-system-card/. Accessed: 2026-04-09
2026
-
[90]
Grok 4.xAI News, July 2025
xAI. Grok 4.xAI News, July 2025. URL https://x.ai/news/grok-4. Accessed: 2026-04-09
2025
-
[91]
Gemini 3 pro - model card
Google. Gemini 3 pro - model card. Technical report, Google, Decem- ber 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf. Accessed: 2026-04-09
2025
-
[92]
Gemini 3.1 pro model card
Google. Gemini 3.1 pro model card. Technical report, Google, Febru- ary 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf. Accessed: 2026-04-09
2026
-
[95]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
2026
-
[97]
Marah I Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat S. Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safo...
-
[100]
Gemini 2.5 deep think model card
Google. Gemini 2.5 deep think model card. Technical report, Google, Au- gust 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Deep-Think-Model-Card.pdf. Accessed: 2026-04-09
2025
-
[101]
System card: Claude mythos preview
Anthropic. System card: Claude mythos preview. Technical report, Anthropic, April 2026. URL https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841. pdf. Accessed: 2026-04-30
2026
-
[102]
Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondrej Certík, Sergey B. Kirpichev, Matthew Rocklin, Amit Kumar, Sergiu Ivanov, Jason Keith Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Stepán ...
-
[106]
System card: Claude opus 4.6
Anthropic. System card: Claude opus 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf. Accessed: 2026-04-09
2026
-
[107]
The lean mathematical library
The mathlib Community. The lean mathematical library. In Jasmin Blanchette and Catalin Hritcu, editors,Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, New Orleans, LA, USA, January 20-21, 2020, pages 367–
2020
-
[108]
In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs
ACM, 2020. doi: 10.1145/3372885.3373824. URLhttps://doi.org/10.1145/3372885. 3373824
-
[109]
AXLE: Axiom Lean Engine
Axiom Math. AXLE: Axiom Lean Engine. https://axle.axiommath.ai/, 2026. Accessed: 2026-04-10
2026
-
[110]
loogle: Mathlib search tool
Joachim Breitner. loogle: Mathlib search tool. https://github.com/nomeata/loogle, 2026. GitHub repository. Accessed: 2026-04-10
2026
-
[111]
Leanexplore: A search engine for lean 4 declarations.arXiv preprint arXiv:2506.11085,
Justin Asher. Leanexplore: A search engine for lean 4 declarations.CoRR, abs/2506.11085,
-
[113]
Gemini 3 flash model card
Google. Gemini 3 flash model card. Technical report, Google, Decem- ber 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Flash-Model-Card.pdf. Accessed: 2026-04-09
2025
-
[115]
Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan S. Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, and Moksh Jain. Recursive self-aggregation unlocks deep thinking in large language models.CoRR, abs/2509.26626, 2025. doi: 10.48550/ARXIV .2509.26626. URL https://doi.o...
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[116]
Try to prove the following statement: {perturbed_statement}
Roger Jin, Jeffrey Quesnelle, Dakota Mahan, Chen Guang, Ryan Teknium, Jun Park, Ibrakhim Ustelbay, Samuel Kim, Miron Yurkevich, Adilet Zauytkhan, Rinat Amankos, Alex Andreyev, Damir Nurlanov, Abuzer Abuov, and Askar massiveaxe. Nomos. https://github.com/ NousResearch/nomos, 2025. GitHub repository. 18 A Benchmark Details Table 4: Benchmarks included in MA...
2025
-
[117]
Chain A: [idea] / Chain B: [idea] /
**Checkpoints (7 pts total)** * Break the solution into logically independent checkpoints with **integer** point values. * Allocate **>= 4 pts** to the main idea/critical steps (all combined); **<= 3 pts** to routine work (all combined).,→ * If a specific checkpoint can be achieved in multiple mathematically equivalent ways, mention this. However, only do...
-
[118]
**Zero-credit items** * List common arguments or observations that **earn 0 points** (e.g., conjectures without proof, especially in geometry, routine restatements of the problem, or dead-ends).,→ * Avoid redundancy with deductions below
-
[119]
cap at x/7
**Deductions** * Bullet each typical/expected mistake with a **flat** penalty (**1**, **2**, or **"cap at x/7"**). Use caps for major problems, while subtraction should be used for smaller arithmetic errors. Try to anticipate likely errors based on the official solution structure. ,→ ,→ * Target logic gaps, invalid claims, circular reasoning, or contradic...
-
[120]
**Mathematical validity** of the proof's reasoning and conclusion
-
[121]
**Problem constraints** (e.g., unique required final value; forbidden tools if stated)
-
[122]
brilliant
**Advisory mapping to the marking scheme** (checkpoints/deductions), allowing different orders and techniques. **Alternative-approach policy:** - If the proof uses a different but valid method, **map its steps to equivalent rubric checkpoints** (same logical role) and award points accordingly.,→ - Apply zero-credit items/deductions **only when the underly...
-
[123]
action":
specific error 2, ... </errors> ### INPUT DATA ## Problem Statement ## {problem_statement} ## Marking Scheme ## {guidelines} ## Proof Solution ## {student_answer} Judge Proof Reconciliation {judge_prompt} You are additionally given the following judgments from other judges on the same proof. They have all read the same proof and marking scheme, but they s...
-
[124]
**Direct derivability** The answer must be derivable *directly and unambiguously* from the abstract alone, without requiring access to the full paper or external references.,→
-
[125]
**Main contribution** The question must target a *primary theorem, result, or quantitative claim* of the paper, not background material, motivation, or related work.,→
-
[126]
**Unambiguous and objective** The question must have exactly **one correct answer**, with no dependence on interpretation, conventions, or unstated assumptions.,→
-
[127]
significant,
**Non-subjective** The question must not involve opinions, qualitative judgments, or vague descriptors (e.g., "significant," "large," "efficient").,→
-
[128]
Additionally, avoid logical expressions and inequalities
**Answer format constraint** The answer must be **either**: - a single numerical value, or - a pure LaTeX mathematical expression The answer **must not contain any English words**, including within LaTeX (symbols and variables are allowed). Additionally, avoid logical expressions and inequalities. Focus on functions, constants, formulas, or specific mathe...
-
[129]
**Question type restriction** The question must **not** be: - yes/no - multiple-choice - a request to prove or explain something
-
[130]
**Machine-verifiable** The answer must be suitable for **rule-based verification**, meaning it can be extracted and compared as a string or parsed LaTeX expression.,→
-
[131]
in this work
**Self-contained** The question must be understandable *on its own*. - Do **not** reference the paper, authors, or phrases like "in this work." - All notation and quantities used must be explicitly defined in the question
-
[132]
**No paper references in the answer** The answer must be a standalone mathematical object and must not refer to the paper, its results, or its statements.,→
-
[133]
**Claim needs to be proven** The authors must say they have actually proven or established the claim in the paper, not just stated it as a conjecture or open problem.,→
-
[134]
In particular, all variables, notation, and quantities used in the question must be explicitly defined within the question itself
**All context provided** Ensure the question contains all necessary context from the abstract to be answerable. In particular, all variables, notation, and quantities used in the question must be explicitly defined within the question itself. ,→ ,→ It is okay if questions are long, as long as they remain clear and unambiguous. 43
-
[135]
stated result
**Be careful with bounds** Some papers prove bounds or inequalities. These are acceptable only if the bound is stated to be tight or exact in the abstract, so that there is a unique correct answer.,→ Otherwise, such abstracts should be rejected. --- ## Examples of Unacceptable Questions - A question that is very easy, and clearly not the main contribution...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.