Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
Pith reviewed 2026-05-19 18:14 UTC · model grok-4.3
The pith
Expanding MathArena to proofs and research questions shows frontier LLMs solve 98% of 2026 USAMO problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MathArena has been extended from final-answer olympiad problems to a broad, maintained platform covering proof-based competitions, research-level arXiv problems, and formal proof generation in Lean, with a fixed evaluation protocol and regular addition of new benchmarks to match improving capabilities, under which the strongest model GPT-5.5 reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions.
What carries the argument
MathArena, a continuously updated evaluation platform that aggregates results across olympiad, proof, research, and formal tasks while adding new problems to stay challenging.
If this is right
- Models can be compared reliably across a wider variety of mathematical tasks including proofs and research questions.
- Progress in LLM mathematical abilities can be tracked over time through regular benchmark updates.
- Evaluation stays relevant as capabilities grow because new problems are introduced when old ones saturate.
- High accuracy on research-level questions indicates LLMs may soon assist with advanced mathematical work.
Where Pith is reading between the lines
- Similar maintained platforms could be built for physics or computer science to monitor AI capabilities in those areas.
- The focus on new benchmarks suggests a community need for private or freshly created problems to avoid contamination.
- If performance trends continue, LLMs might contribute to solving open research mathematics problems within a few years.
Load-bearing premise
The benchmarks and protocol remain free of training-data contamination and measure genuine reasoning rather than memorization or shortcuts.
What would settle it
A model achieving high scores on MathArena but low scores on a fresh set of never-published math problems of similar difficulty would indicate contamination or overfitting instead of reasoning ability.
Figures
read the original abstract
Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean. Additionally, we maintain a clear evaluation protocol for all models and regularly design new benchmarks as model capabilities improve to ensure that MathArena remains challenging. Notably, the strongest model, GPT-5.5, now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions, showing that frontier models can now comfortably solve extremely challenging mathematical problems. This highlights the importance of continuously maintained evaluation platforms like MathArena to track the rapid progress of LLMs in mathematical reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MathArena as an expanded, continuously maintained evaluation platform for LLM mathematical reasoning. It broadens the original olympiad-focused benchmark to encompass proof-based competitions, research-level arXiv problems, and formal proof generation in Lean, while maintaining a fixed evaluation protocol and introducing new benchmarks as capabilities advance. The central empirical claim is that GPT-5.5 reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions.
Significance. If the reported performance numbers rest on genuinely novel, uncontaminated problems that measure reasoning rather than memorization, the work would provide valuable evidence of rapid progress in frontier models' mathematical capabilities and would strengthen the case for dynamic, maintained evaluation platforms over static benchmarks.
major comments (2)
- [Abstract / Evaluation Protocol] Abstract and the section describing the evaluation protocol: the headline results (GPT-5.5 at 98% on 2026 USAMO and 74% on research-level arXiv problems) are presented as evidence that frontier models now solve extremely challenging problems, yet no concrete description is given of problem sourcing, leakage detection against common pre-training corpora, or statistical controls for benchmark-specific shortcuts; without these the numbers cannot be interpreted as genuine reasoning gains.
- [Benchmark Maintenance] The subsection on benchmark maintenance and new benchmark design: the claim that benchmarks are regularly updated 'as model capabilities improve' is central to the platform's value, but the manuscript supplies no explicit criteria, sourcing pipeline, or verification steps that would prevent post-hoc selection or distributional overlap with training data.
minor comments (1)
- [Results] Tables or figures reporting per-model, per-task accuracies would benefit from explicit error bars or sample-size information to allow readers to assess the stability of the 98% and 74% figures.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments correctly identify areas where the manuscript would benefit from greater transparency on problem sourcing, contamination controls, and benchmark maintenance procedures. We have revised the manuscript to incorporate these details and address the concerns point by point below.
read point-by-point responses
-
Referee: [Abstract / Evaluation Protocol] Abstract and the section describing the evaluation protocol: the headline results (GPT-5.5 at 98% on 2026 USAMO and 74% on research-level arXiv problems) are presented as evidence that frontier models now solve extremely challenging problems, yet no concrete description is given of problem sourcing, leakage detection against common pre-training corpora, or statistical controls for benchmark-specific shortcuts; without these the numbers cannot be interpreted as genuine reasoning gains.
Authors: We agree that the original manuscript provided insufficient detail on these aspects, which limits interpretability of the headline results. In the revised version we have expanded the evaluation protocol section with: (1) sourcing information stating that 2026 USAMO problems come from official post-2025 contest releases and research problems are drawn exclusively from arXiv preprints dated after the training cutoffs of all evaluated models; (2) a description of our leakage detection pipeline, which combines n-gram overlap checks against known public pre-training corpora with embedding-based semantic similarity searches; and (3) statistical controls including performance on a set of synthetically generated control problems designed to detect shortcut reliance. These additions allow readers to assess the degree to which results reflect reasoning rather than memorization. revision: yes
-
Referee: [Benchmark Maintenance] The subsection on benchmark maintenance and new benchmark design: the claim that benchmarks are regularly updated 'as model capabilities improve' is central to the platform's value, but the manuscript supplies no explicit criteria, sourcing pipeline, or verification steps that would prevent post-hoc selection or distributional overlap with training data.
Authors: We acknowledge the manuscript originally lacked explicit criteria and verification steps for benchmark maintenance. The revised subsection now specifies: new benchmarks are sourced only from competitions and arXiv papers released after the most recent model training cutoffs; each problem undergoes expert review for novelty and rigor; and distributional overlap is checked via cosine similarity on sentence embeddings against both training corpora and prior MathArena items. The sourcing pipeline is described as a combination of automated retrieval from official archives followed by manual curation to avoid post-hoc selection. These changes directly address the risk of contamination and selective reporting. revision: yes
Circularity Check
No circularity: purely empirical platform description with direct evaluations
full rationale
The paper presents MathArena as a continuously maintained evaluation platform that broadens an existing benchmark to include olympiad problems, research-level arXiv questions, and Lean proofs, while reporting direct model performance numbers such as GPT-5.5 reaching 98% on 2026 USAMO and 74% on research questions. No derivations, equations, fitted parameters, or self-referential reductions appear in the provided text; the results are presented as outcomes of running models on the described tasks under a maintained protocol. The work is therefore self-contained with no load-bearing steps that collapse to author-defined inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standardized evaluation protocols can produce comparable scores across different LLMs on mathematical tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs... GPT-5.5 now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Final-answer competitions... Proof-based competitions... Research-level benchmarks (ArXivMath, BrokenArXiv, ArXivLean)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 9 Pith papers
-
CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency
CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
-
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME...
-
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.
-
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
-
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.
-
BFLA: Block-Filtered Long-Context Attention Mechanism
BFLA is a two-stage block-filtered sparse prefill attention mechanism that constructs an input-dependent block mask and applies tile-level rescues to skip unimportant KV tiles while preserving exact attention inside r...
-
A Survey of Audio Reasoning in Multimodal Foundation Models
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
Reference graph
Works this paper leans on
-
[6]
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett,...
work page 2024
-
[7]
Ai model & api providers analysis
Artificial Analysis. Ai model & api providers analysis. https://artificialanalysis.ai/,
-
[8]
Accessed: 2026-04-13
work page 2026
-
[9]
Benchmarks.https://www.vals.ai/benchmarks, 2026
Vals AI. Benchmarks.https://www.vals.ai/benchmarks, 2026. Accessed: 2026-04-13
work page 2026
-
[10]
Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovic, and Martin T. Vechev. Matharena: Evaluating llms on uncontaminated math competitions.CoRR, abs/2505.23281,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
The lean theorem prover (system description)
Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. The lean theorem prover (system description). In Amy P. Felty and Aart Middeldorp, editors,Automated Deduction - CADE-25 - 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings, Lecture Notes in Computer Science, p...
-
[13]
OpenAI. Gpt-5.5 system card. Technical report, OpenAI, April 2026. URL https://openai. com/index/gpt-5-5-system-card/. Accessed: 2026-04-27
work page 2026
-
[14]
Deepseek-v4: Towards highly efficient million-token context intelligence
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, 2026. URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/ blob/main/DeepSeek_V4.pdf. Released April 24, 2026. Accessed: 2026-04-29
work page 2026
-
[15]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Measuring mathematical problem solv- ing with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solv- ing with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2...
work page 2021
-
[17]
URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html
work page 2021
-
[18]
Omni-math: A universal olympiad level mathematic benchmark for large language models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Cheng- hao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shang- haoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models. InThe Thirtee...
work page 2025
-
[19]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceed...
work page 2024
-
[20]
doi: 10.18653/v1/2024.acl-long
Association for Computational Linguistics, 2024. doi: 10.18653/V1/2024.ACL-LONG
-
[21]
URLhttps://doi.org/10.18653/v1/2024.acl-long.211
-
[22]
Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, and Can Yang. Ugmathbench: A diverse and dynamic benchmark for undergraduate-level mathematical reasoning with large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/fo...
work page 2025
-
[23]
Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P
Jingxuan Fan, Sarah Martinson, Erik Y . Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P. Brenner. Hardmath: A benchmark dataset for challenging problems in applied mathematics. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. UR...
work page 2025
-
[30]
Mathematical capabilities of chatgpt
Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...
work page 2023
-
[32]
Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning. In Christos Christodoulopou...
-
[37]
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,
work page 2025
-
[38]
URLhttps://openreview.net/forum?id=3GTtZFiajM
-
[42]
Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition
George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jen- nings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, edi- tors,Advances in ...
work page 2024
-
[44]
minif2f: a cross-system benchmark for formal olympiad-level mathematics
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv
work page 2022
-
[47]
Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026
Nikil Ravi, Langston Nashold, and Rayan Krishnan. Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026. URL https://www.vals.ai/ benchmarks/proof_bench. Accessed: 2026-04-29
work page 2026
-
[50]
Aristotle: IMO-level Automated Theorem Proving
Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, and La...
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Sciencebench: Project benchmarks
Christian Stump. Sciencebench: Project benchmarks. https://math.sciencebench.ai/ benchmarks, March 2026. Published March 23, 2026. Accessed April 8, 2026
work page 2026
-
[64]
Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. Abstentionbench: Reasoning llms fail on unanswerable questions.CoRR, abs/2506.09038, 2025. doi: 10.48550/ ARXIV .2506.09038. URLhttps://doi.org/10.48550/arXiv.2506.09038
-
[65]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,
work page 2024
-
[66]
URLhttps://openreview.net/forum?id=KUNzEQMWU7
-
[67]
Charxiv: Charting gaps in realistic chart understanding in multimodal llms
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. In Amir Globersons, Lester Mackey, Danielle Belgrave, An- gela Fan, Ulrich Paquet, Jakub M. Tomczak, and Ch...
work page 2024
-
[68]
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/forum?id=V...
work page 2025
-
[73]
Project Euler. Project euler, 2025. URLhttps://projecteuler.net/. Accessed: 2025
work page 2025
-
[75]
American Invitational Mathematics Examination
Mathematical Association of America. American Invitational Mathematics Examination. https://maa.org/maa-invitational-competitions/, 2026. Accessed: 2026-04-09
work page 2026
-
[76]
Harvard-mit mathematics tournament, 2025
HMMT. Harvard-mit mathematics tournament, 2025. URL https://www.hmmt.org/. Ac- cessed: 2025
work page 2025
-
[77]
Vision language models are blind
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. In Minsu Cho, Ivan Laptev, Du Tran, Angela Yao, and Hongbin Zha, editors,Computer Vision - ACCV 2024 - 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8-12, 2024, Proceedings, Part V, Lecture Notes in Computer Science, p...
-
[78]
Kangourou sans Frontières. Kangourou sans Frontières. https://www.aksf.org/, 2026. Accessed: 2026-04-09
work page 2026
-
[79]
Miklós Schweitzer Memorial Competition
Bolyai János Mathematical Society. Miklós Schweitzer Memorial Competition. https: //www.bolyai.hu/versenyek-schweitzer-miklos-emlekverseny/ , 2026. Accessed: 2026- 04-09
work page 2026
-
[80]
William Lowell Putnam Mathematical Competition
Mathematical Association of America. William Lowell Putnam Mathematical Competition. https://maa.org/putnam/, 2026. Accessed: 2026-04-09
work page 2026
-
[81]
International Mathematics Competition for University Students
International Mathematics Competition for University Students. International Mathematics Competition for University Students. https://www.imc-math.org.uk/, 2026. Accessed: 2026-04-09
work page 2026
-
[83]
leanprover. comparator, 2026. URL https://github.com/leanprover/comparator. GitHub repository, retrieved 2026-04-17
work page 2026
-
[84]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, April 2025. URL https://openai.com/index/o3-o4-mini-system-card/. Accessed: 2026-04-09
work page 2025
-
[86]
OpenAI. Openai o3-mini system card. Technical report, OpenAI, January 2025. URL https://openai.com/index/o3-mini-system-card/. Accessed: 2026-04-09
work page 2025
-
[88]
Update to gpt-5 system card: Gpt-5.2
OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, OpenAI, Decem- ber 2025. URL https://openai.com/index/gpt-5-system-card-update-gpt-5-2/ . Ac- cessed: 2026-04-09
work page 2025
-
[89]
OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https://openai.com/index/gpt-5-4-thinking-system-card/. Accessed: 2026-04-09
work page 2026
-
[90]
xAI. Grok 4.xAI News, July 2025. URL https://x.ai/news/grok-4. Accessed: 2026-04-09
work page 2025
-
[91]
Google. Gemini 3 pro - model card. Technical report, Google, Decem- ber 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf. Accessed: 2026-04-09
work page 2025
-
[92]
Google. Gemini 3.1 pro model card. Technical report, Google, Febru- ary 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf. Accessed: 2026-04-09
work page 2026
-
[95]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
work page 2026
-
[97]
Marah I Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat S. Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safo...
work page internal anchor Pith review Pith/arXiv arXiv
-
[100]
Gemini 2.5 deep think model card
Google. Gemini 2.5 deep think model card. Technical report, Google, Au- gust 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Deep-Think-Model-Card.pdf. Accessed: 2026-04-09
work page 2025
-
[101]
System card: Claude mythos preview
Anthropic. System card: Claude mythos preview. Technical report, Anthropic, April 2026. URL https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841. pdf. Accessed: 2026-04-30
work page 2026
-
[102]
Smith, Mateusz Paprocki, Ondrej Certik, Sergey B
Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondrej Certík, Sergey B. Kirpichev, Matthew Rocklin, Amit Kumar, Sergiu Ivanov, Jason Keith Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Stepán ...
-
[106]
Anthropic. System card: Claude opus 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf. Accessed: 2026-04-09
work page 2026
-
[107]
The mathlib Community. The lean mathematical library. In Jasmin Blanchette and Catalin Hritcu, editors,Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, New Orleans, LA, USA, January 20-21, 2020, pages 367–
work page 2020
-
[108]
ACM, 2020. doi: 10.1145/3372885.3373824. URLhttps://doi.org/10.1145/3372885. 3373824
-
[109]
Axiom Math. AXLE: Axiom Lean Engine. https://axle.axiommath.ai/, 2026. Accessed: 2026-04-10
work page 2026
-
[110]
Joachim Breitner. loogle: Mathlib search tool. https://github.com/nomeata/loogle, 2026. GitHub repository. Accessed: 2026-04-10
work page 2026
-
[111]
Leanexplore: A search engine for lean 4 declarations.CoRR, abs/2506.11085,
Justin Asher. Leanexplore: A search engine for lean 4 declarations.CoRR, abs/2506.11085,
-
[113]
Google. Gemini 3 flash model card. Technical report, Google, Decem- ber 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Flash-Model-Card.pdf. Accessed: 2026-04-09
work page 2025
-
[115]
Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan S. Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, and Moksh Jain. Recursive self-aggregation unlocks deep thinking in large language models.CoRR, abs/2509.26626, 2025. doi: 10.48550/ARXIV .2509.26626. URL https://doi.o...
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[116]
Try to prove the following statement: {perturbed_statement}
Roger Jin, Jeffrey Quesnelle, Dakota Mahan, Chen Guang, Ryan Teknium, Jun Park, Ibrakhim Ustelbay, Samuel Kim, Miron Yurkevich, Adilet Zauytkhan, Rinat Amankos, Alex Andreyev, Damir Nurlanov, Abuzer Abuov, and Askar massiveaxe. Nomos. https://github.com/ NousResearch/nomos, 2025. GitHub repository. 18 A Benchmark Details Table 4: Benchmarks included in MA...
work page 2025
-
[117]
Chain A: [idea] / Chain B: [idea] /
**Checkpoints (7 pts total)** * Break the solution into logically independent checkpoints with **integer** point values. * Allocate **>= 4 pts** to the main idea/critical steps (all combined); **<= 3 pts** to routine work (all combined).,→ * If a specific checkpoint can be achieved in multiple mathematically equivalent ways, mention this. However, only do...
-
[118]
**Zero-credit items** * List common arguments or observations that **earn 0 points** (e.g., conjectures without proof, especially in geometry, routine restatements of the problem, or dead-ends).,→ * Avoid redundancy with deductions below
-
[119]
**Deductions** * Bullet each typical/expected mistake with a **flat** penalty (**1**, **2**, or **"cap at x/7"**). Use caps for major problems, while subtraction should be used for smaller arithmetic errors. Try to anticipate likely errors based on the official solution structure. ,→ ,→ * Target logic gaps, invalid claims, circular reasoning, or contradic...
-
[120]
**Mathematical validity** of the proof's reasoning and conclusion
-
[121]
**Problem constraints** (e.g., unique required final value; forbidden tools if stated)
-
[122]
**Advisory mapping to the marking scheme** (checkpoints/deductions), allowing different orders and techniques. **Alternative-approach policy:** - If the proof uses a different but valid method, **map its steps to equivalent rubric checkpoints** (same logical role) and award points accordingly.,→ - Apply zero-credit items/deductions **only when the underly...
-
[123]
specific error 2, ... </errors> ### INPUT DATA ## Problem Statement ## {problem_statement} ## Marking Scheme ## {guidelines} ## Proof Solution ## {student_answer} Judge Proof Reconciliation {judge_prompt} You are additionally given the following judgments from other judges on the same proof. They have all read the same proof and marking scheme, but they s...
-
[124]
**Direct derivability** The answer must be derivable *directly and unambiguously* from the abstract alone, without requiring access to the full paper or external references.,→
-
[125]
**Main contribution** The question must target a *primary theorem, result, or quantitative claim* of the paper, not background material, motivation, or related work.,→
-
[126]
**Unambiguous and objective** The question must have exactly **one correct answer**, with no dependence on interpretation, conventions, or unstated assumptions.,→
-
[127]
**Non-subjective** The question must not involve opinions, qualitative judgments, or vague descriptors (e.g., "significant," "large," "efficient").,→
-
[128]
Additionally, avoid logical expressions and inequalities
**Answer format constraint** The answer must be **either**: - a single numerical value, or - a pure LaTeX mathematical expression The answer **must not contain any English words**, including within LaTeX (symbols and variables are allowed). Additionally, avoid logical expressions and inequalities. Focus on functions, constants, formulas, or specific mathe...
-
[129]
**Question type restriction** The question must **not** be: - yes/no - multiple-choice - a request to prove or explain something
-
[130]
**Machine-verifiable** The answer must be suitable for **rule-based verification**, meaning it can be extracted and compared as a string or parsed LaTeX expression.,→
-
[131]
**Self-contained** The question must be understandable *on its own*. - Do **not** reference the paper, authors, or phrases like "in this work." - All notation and quantities used must be explicitly defined in the question
-
[132]
**No paper references in the answer** The answer must be a standalone mathematical object and must not refer to the paper, its results, or its statements.,→
-
[133]
**Claim needs to be proven** The authors must say they have actually proven or established the claim in the paper, not just stated it as a conjecture or open problem.,→
-
[134]
**All context provided** Ensure the question contains all necessary context from the abstract to be answerable. In particular, all variables, notation, and quantities used in the question must be explicitly defined within the question itself. ,→ ,→ It is okay if questions are long, as long as they remain clear and unambiguous. 43
-
[135]
**Be careful with bounds** Some papers prove bounds or inequalities. These are acceptable only if the bound is stated to be tight or exact in the abstract, so that there is a unique correct answer.,→ Otherwise, such abstracts should be rejected. --- ## Examples of Unacceptable Questions - A question that is very easy, and clearly not the main contribution...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.