arxiv: 2605.00674 · v1 · submitted 2026-05-01 · 💻 cs.CL

Recognition: unknown

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck , Nikola Jovanovi\'c , Tim Gehrunger , K\'ari R\"ognvalddson , Ivo Petrov , Chenhao Sun , Martin Vechev

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationmathematical reasoningdynamic benchmarksolympiad problemsresearch-level mathformal proofsLeanevaluation platform

0 comments

The pith

MathArena turns static math benchmarks into a continuously updated platform that adds new tasks to track LLM reasoning as capabilities advance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Static benchmarks for LLM mathematical ability quickly saturate and become contaminated or unrepresentative as models improve. The authors build a maintained evaluation platform called MathArena that aggregates results across olympiad problems, proof competitions, research-level arXiv questions, and formal Lean proofs while regularly introducing harder tasks. This setup uses a consistent protocol to compare models fairly over time. The platform reports that the leading model reaches 98 percent accuracy on the 2026 USA Math Olympiad and 74 percent on research-level problems, indicating frontier systems now handle many extremely difficult mathematical tasks. The work shows why dynamic platforms are needed to measure genuine progress rather than benchmark overfitting.

Core claim

MathArena expands from final-answer olympiad problems into a maintained platform covering proof-based competitions, research-level arXiv problems, and Lean formal proof generation, with a fixed evaluation protocol and ongoing addition of new benchmarks to keep the suite challenging. The strongest model achieves 98 percent on the 2026 USA Math Olympiad and 74 percent on research-level questions, establishing that frontier LLMs can now solve extremely challenging mathematical problems across informal and formal settings.

What carries the argument

MathArena, a continuously maintained evaluation platform that runs, aggregates, and analyzes LLM performance across a growing set of mathematical tasks with a fixed protocol and periodic addition of new problems to prevent saturation.

If this is right

Models can be compared consistently over time even after individual benchmarks are solved.
Regular addition of harder tasks keeps the evaluation relevant to current frontier capabilities.
High scores on olympiad and research problems support using LLMs as collaborators on difficult mathematics.
Including both informal and formal tasks allows tracking progress across different styles of mathematical work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption of such platforms could shift focus from creating one-off benchmarks to long-term maintenance of evaluation suites.
The data may help identify persistent failure modes, such as specific proof techniques or research domains where models still lag.
Extending the platform to additional formal systems beyond Lean could test whether current high performance generalizes across proof assistants.

Load-bearing premise

The new tasks and evaluation protocol stay fair and free of training-data contamination as models scale and the platform adds more problems.

What would settle it

Detection of training-data overlap in the research-level arXiv problems or a sudden drop in all models' scores on a newly added benchmark would indicate the reported performance no longer reflects genuine reasoning ability.

Figures

Figures reproduced from arXiv: 2605.00674 by Chenhao Sun, Ivo Petrov, Jasper Dekoninck, K\'ari R\"ognvalddson, Martin Vechev, Nikola Jovanovi\'c, Tim Gehrunger.

**Figure 1.** Figure 1: Over the last year, MATHARENA has evolved through repeated benchmark launches (green), human grading (orange), interface improvements (blue), blog posts (purple), and new model evaluations (black). This lifecycle is what distinguishes an evaluation platform from a benchmark. MATHARENA: a platform for mathematical reasoning. The MATHARENA benchmark [9] introduced uncontaminated mathematics competitions as a… view at source ↗

**Figure 2.** Figure 2: Performance over time. Key takeaways. As shown in view at source ↗

**Figure 3.** Figure 3: Performance versus cost for all models with sufficient website coverage. Details on how view at source ↗

**Figure 4.** Figure 4: Absolute score gaps between the final human-validated USAMO 2026 grades and each automated grading setting. Automated proof evaluation is accurate. In view at source ↗

**Figure 5.** Figure 5: Leave-one-family-out rank robustness for the top 10 models under the full expected view at source ↗

**Figure 6.** Figure 6: Empirical coverage versus nominal level. Dashed is ideal calibration. 6K 10K 20K 50K 100K Average No. Tokens 15.0% 20.0% 25.0% 30.0% 35.0% Accuracy Agents DeepSeek Math Nomos RSA SelfCheck view at source ↗

**Figure 8.** Figure 8: The homepage presents the live leaderboard together with benchmark tabs, filters, and view at source ↗

**Figure 9.** Figure 9: The benchmark index page links each competition to scores, datasets, outputs, and short view at source ↗

**Figure 10.** Figure 10: The detailed leaderboard adds confidence intervals, token counts, retries, runtime, and view at source ↗

**Figure 11.** Figure 11: Each model page aggregates benchmark performance, expected rank, pricing, and view at source ↗

**Figure 12.** Figure 12: The comparison view places two models side by side to highlight differences in accuracy view at source ↗

**Figure 13.** Figure 13: Problem pages show the prompt, the verified answer, run-by-run outputs, and the full view at source ↗

**Figure 14.** Figure 14: The surprising-traces view surfaces high-loss failures that deviate strongly from a model’s view at source ↗

**Figure 15.** Figure 15: The blog page collects benchmark launches and qualitative analyses to document notable view at source ↗

read the original abstract

Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean. Additionally, we maintain a clear evaluation protocol for all models and regularly design new benchmarks as model capabilities improve to ensure that MathArena remains challenging. Notably, the strongest model, GPT-5.5, now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions, showing that frontier models can now comfortably solve extremely challenging mathematical problems. This highlights the importance of continuously maintained evaluation platforms like MathArena to track the rapid progress of LLMs in mathematical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MathArena's expansion into proofs, research problems, and Lean formalization is a practical systems move for keeping math LLM evaluations current, but the headline performance numbers rest on thin evidence about data contamination.

read the letter

The core contribution here is turning the original MathArena into a maintained platform that now includes proof-based competitions, research-level arXiv problems, and formal proof generation in Lean, plus a commitment to add fresh tasks as models advance. That shift from static benchmark to ongoing system addresses a real issue: olympiad-style tests saturate quickly and stop distinguishing models. The reported results for GPT-5.5 at 98% on the 2026 USAMO set and 74% on research questions are the parts that stand out if they hold.

Referee Report

2 major / 1 minor

Summary. The paper presents MathArena as an expanded, continuously maintained evaluation platform for LLM mathematical reasoning. It broadens the original olympiad-focused benchmark to encompass proof-based competitions, research-level arXiv problems, and Lean formal proof generation, while stressing the need for regularly updated tasks to avoid saturation of static benchmarks. The work reports that the strongest evaluated model, GPT-5.5, achieves 98% on the 2026 USA Math Olympiad and 74% on research-level questions.

Significance. If the underlying evaluation protocol and benchmark construction can be shown to be robust against contamination, the platform would usefully illustrate the rapid progress of frontier LLMs on advanced mathematics and provide a model for dynamic, domain-specific evaluation systems. The emphasis on ongoing benchmark renewal addresses a genuine limitation of fixed test sets.

major comments (2)

[Abstract] Abstract: The headline performance claims (98% on 2026 USAMO, 74% on research-level questions) are stated without any description of evaluation methodology, data sources, problem provenance, answer verification procedures, or contamination checks. These details are load-bearing for the central assertion that frontier models now 'comfortably solve extremely challenging mathematical problems.'
[Platform and protocol sections] Platform description: The manuscript asserts that a 'clear evaluation protocol' is maintained and that new benchmarks are 'regularly designed as model capabilities improve,' yet supplies no concrete mechanisms (e.g., post-cutoff problem generation dates, n-gram or embedding overlap checks against training corpora, or standardized prompting and scoring rules for open-ended proofs).

minor comments (1)

A summary table listing task categories, number of problems per category, and evaluation metrics would help readers quickly grasp the platform's breadth and the basis for the aggregated performance figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the clarity and robustness of our presentation. We have revised the manuscript to address the concerns about methodological details in the abstract and platform sections. Below we respond point by point.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance claims (98% on 2026 USAMO, 74% on research-level questions) are stated without any description of evaluation methodology, data sources, problem provenance, answer verification procedures, or contamination checks. These details are load-bearing for the central assertion that frontier models now 'comfortably solve extremely challenging mathematical problems.'

Authors: We agree that the abstract's brevity omits these supporting details, which are central to the claims. The full manuscript describes the sources and protocol in later sections, but we have now revised the abstract to include a concise summary: problems are drawn from official post-2024 competition archives and recent arXiv submissions, verified via automated solvers plus expert review, and screened for contamination through date-based cutoffs and overlap checks. This revision strengthens the abstract without altering its length constraints. revision: yes
Referee: [Platform and protocol sections] Platform description: The manuscript asserts that a 'clear evaluation protocol' is maintained and that new benchmarks are 'regularly designed as model capabilities improve,' yet supplies no concrete mechanisms (e.g., post-cutoff problem generation dates, n-gram or embedding overlap checks against training corpora, or standardized prompting and scoring rules for open-ended proofs).

Authors: We acknowledge that the original text referenced the protocol at a high level without enumerating the concrete mechanisms. In the revised manuscript we have expanded the Platform and Evaluation Protocol sections to specify: all problems use post-cutoff dates (2025 onward for competitions and arXiv papers after model training windows); contamination detection employs n-gram overlap thresholds and embedding cosine similarity against public training corpora; prompting is standardized with fixed chain-of-thought templates and few-shot examples; and scoring combines Lean verifiers for formal proofs with rubric-based expert grading for open-ended work. We also detail the saturation-based process for introducing new benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive platform paper with no derivations, equations, or self-referential predictions

full rationale

The paper is a descriptive account of expanding the MathArena evaluation platform to cover proof-based competitions, research-level arXiv problems, and Lean formal proofs, while maintaining an evaluation protocol and adding new benchmarks. It reports observed performance numbers (e.g., GPT-5.5 at 98% on 2026 USAMO) without any derivation chain, fitted parameters, predictions, or equations. No load-bearing self-citations reduce claims to tautologies; the reference to the original MathArena is contextual background, not a premise that forces the reported results. The concern about benchmark contamination is an external validity issue, not a circular reduction in the paper's own logic. No enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the paper is a systems description of an evaluation platform rather than a theoretical contribution.

pith-pipeline@v0.9.0 · 5551 in / 1065 out tokens · 34846 ms · 2026-05-09T19:15:22.109205+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency
stat.ML 2026-05 unverdicted novelty 7.0

CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
cs.AI 2026-05 unverdicted novelty 6.0

Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
BFLA: Block-Filtered Long-Context Attention Mechanism
eess.SP 2026-05 unverdicted novelty 4.0

BFLA is a two-stage block-filtered sparse prefill attention mechanism that constructs an input-dependent block mask and applies tile-level rescues to skip unimportant KV tiles while preserving exact attention inside r...

Reference graph

Works this paper leans on

110 extracted references · 14 canonical work pages · cited by 5 Pith papers · 3 internal anchors

[6]

Jordan, Joseph E

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett,...

2024
[7]

Ai model & api providers analysis

Artificial Analysis. Ai model & api providers analysis. https://artificialanalysis.ai/,
[8]

Accessed: 2026-04-13

2026
[9]

Benchmarks.https://www.vals.ai/benchmarks, 2026

Vals AI. Benchmarks.https://www.vals.ai/benchmarks, 2026. Accessed: 2026-04-13

2026
[10]

Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovic, and Martin T. Vechev. Matharena: Evaluating llms on uncontaminated math competitions.CoRR, abs/2505.23281,

work page internal anchor Pith review arXiv
[12]

The lean theorem prover (system description)

Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. The lean theorem prover (system description). In Amy P. Felty and Aart Middeldorp, editors,Automated Deduction - CADE-25 - 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings, Lecture Notes in Computer Science, p...

work page doi:10.1007/978-3-319-21401-6 2015
[13]

Gpt-5.5 system card

OpenAI. Gpt-5.5 system card. Technical report, OpenAI, April 2026. URL https://openai. com/index/gpt-5-5-system-card/. Accessed: 2026-04-27

2026
[14]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, 2026. URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/ blob/main/DeepSeek_V4.pdf. Released April 24, 2026. Accessed: 2026-04-29

2026
[15]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Measuring mathematical problem solv- ing with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solv- ing with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2...

2021
[17]

URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html

2021
[18]

Omni-math: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Cheng- hao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shang- haoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models. InThe Thirtee...

2025
[19]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceed...

2024
[20]

org/CorpusID:273375463

Association for Computational Linguistics, 2024. doi: 10.18653/V1/2024.ACL-LONG

work page doi:10.18653/v1/2024.acl-long 2024
[21]

URLhttps://doi.org/10.18653/v1/2024.acl-long.211

work page doi:10.18653/v1/2024.acl-long.211 2024
[22]

Ugmathbench: A diverse and dynamic benchmark for undergraduate-level mathematical reasoning with large language models

Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, and Can Yang. Ugmathbench: A diverse and dynamic benchmark for undergraduate-level mathematical reasoning with large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/fo...

2025
[23]

Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P

Jingxuan Fan, Sarah Martinson, Erik Y . Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P. Brenner. Hardmath: A benchmark dataset for challenging problems in applied mathematics. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. UR...

2025
[30]

Mathematical capabilities of chatgpt

Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

2023
[32]

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning. In Christos Christodoulopou...

work page doi:10.18653/v1/2025.emnlp-main.1794 2025
[37]

Chawla, and Xiangliang Zhang

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,

2025
[38]

URLhttps://openreview.net/forum?id=3GTtZFiajM
[42]

Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition

George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jen- nings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, edi- tors,Advances in ...

2024
[44]

minif2f: a cross-system benchmark for formal olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv

2022
[47]

Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026

Nikil Ravi, Langston Nashold, and Rayan Krishnan. Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026. URL https://www.vals.ai/ benchmarks/proof_bench. Accessed: 2026-04-29

2026
[50]

Aristotle: Imo-level automated theorem proving.arXiv preprint arXiv:2510.01346,

Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, and La...

work page arXiv
[56]

Sciencebench: Project benchmarks

Christian Stump. Sciencebench: Project benchmarks. https://math.sciencebench.ai/ benchmarks, March 2026. Published March 23, 2026. Accessed April 8, 2026

2026
[64]

Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. Abstentionbench: Reasoning llms fail on unanswerable questions.CoRR, abs/2506.09038, 2025. doi: 10.48550/ ARXIV .2506.09038. URLhttps://doi.org/10.48550/arXiv.2506.09038

work page doi:10.48550/arxiv.2506.09038 2025
[65]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024
[66]

URLhttps://openreview.net/forum?id=KUNzEQMWU7
[67]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. In Amir Globersons, Lester Mackey, Danielle Belgrave, An- gela Fan, Ulrich Paquet, Jakub M. Tomczak, and Ch...

2024
[68]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/forum?id=V...

2025
[73]

Project euler, 2025

Project Euler. Project euler, 2025. URLhttps://projecteuler.net/. Accessed: 2025

2025
[75]

American Invitational Mathematics Examination

Mathematical Association of America. American Invitational Mathematics Examination. https://maa.org/maa-invitational-competitions/, 2026. Accessed: 2026-04-09

2026
[76]

Harvard-mit mathematics tournament, 2025

HMMT. Harvard-mit mathematics tournament, 2025. URL https://www.hmmt.org/. Ac- cessed: 2025

2025
[77]

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. In Minsu Cho, Ivan Laptev, Du Tran, Angela Yao, and Hongbin Zha, editors,Computer Vision - ACCV 2024 - 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8-12, 2024, Proceedings, Part V, Lecture Notes in Computer Science, p...

work page doi:10.1007/978-981-96-0917-8 2024
[78]

Kangourou sans Frontières

Kangourou sans Frontières. Kangourou sans Frontières. https://www.aksf.org/, 2026. Accessed: 2026-04-09

2026
[79]

Miklós Schweitzer Memorial Competition

Bolyai János Mathematical Society. Miklós Schweitzer Memorial Competition. https: //www.bolyai.hu/versenyek-schweitzer-miklos-emlekverseny/ , 2026. Accessed: 2026- 04-09

2026
[80]

William Lowell Putnam Mathematical Competition

Mathematical Association of America. William Lowell Putnam Mathematical Competition. https://maa.org/putnam/, 2026. Accessed: 2026-04-09

2026
[81]

International Mathematics Competition for University Students

International Mathematics Competition for University Students. International Mathematics Competition for University Students. https://www.imc-math.org.uk/, 2026. Accessed: 2026-04-09

2026
[83]

comparator, 2026

leanprover. comparator, 2026. URL https://github.com/leanprover/comparator. GitHub repository, retrieved 2026-04-17

2026
[84]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, April 2025. URL https://openai.com/index/o3-o4-mini-system-card/. Accessed: 2026-04-09

2025
[86]

Openai o3-mini system card

OpenAI. Openai o3-mini system card. Technical report, OpenAI, January 2025. URL https://openai.com/index/o3-mini-system-card/. Accessed: 2026-04-09

2025
[88]

Update to gpt-5 system card: Gpt-5.2

OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, OpenAI, Decem- ber 2025. URL https://openai.com/index/gpt-5-system-card-update-gpt-5-2/ . Ac- cessed: 2026-04-09

2025
[89]

Gpt-5.4 thinking system card

OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https://openai.com/index/gpt-5-4-thinking-system-card/. Accessed: 2026-04-09

2026
[90]

Grok 4.xAI News, July 2025

xAI. Grok 4.xAI News, July 2025. URL https://x.ai/news/grok-4. Accessed: 2026-04-09

2025
[91]

Gemini 3 pro - model card

Google. Gemini 3 pro - model card. Technical report, Google, Decem- ber 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf. Accessed: 2026-04-09

2025
[92]

Gemini 3.1 pro model card

Google. Gemini 3.1 pro model card. Technical report, Google, Febru- ary 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf. Accessed: 2026-04-09

2026
[95]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[97]

Marah I Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat S. Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safo...

work page arXiv
[100]

Gemini 2.5 deep think model card

Google. Gemini 2.5 deep think model card. Technical report, Google, Au- gust 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Deep-Think-Model-Card.pdf. Accessed: 2026-04-09

2025
[101]

System card: Claude mythos preview

Anthropic. System card: Claude mythos preview. Technical report, Anthropic, April 2026. URL https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841. pdf. Accessed: 2026-04-30

2026
[102]

doi: 10.7717/peerj-cs.103

Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondrej Certík, Sergey B. Kirpichev, Matthew Rocklin, Amit Kumar, Sergiu Ivanov, Jason Keith Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Stepán ...

work page doi:10.7717/peerj-cs.103 2017
[106]

System card: Claude opus 4.6

Anthropic. System card: Claude opus 4.6. Technical report, Anthropic, February 2026. URL https://www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf. Accessed: 2026-04-09

2026
[107]

The lean mathematical library

The mathlib Community. The lean mathematical library. In Jasmin Blanchette and Catalin Hritcu, editors,Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, New Orleans, LA, USA, January 20-21, 2020, pages 367–

2020
[108]

In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs

ACM, 2020. doi: 10.1145/3372885.3373824. URLhttps://doi.org/10.1145/3372885. 3373824

work page doi:10.1145/3372885.3373824 2020
[109]

AXLE: Axiom Lean Engine

Axiom Math. AXLE: Axiom Lean Engine. https://axle.axiommath.ai/, 2026. Accessed: 2026-04-10

2026
[110]

loogle: Mathlib search tool

Joachim Breitner. loogle: Mathlib search tool. https://github.com/nomeata/loogle, 2026. GitHub repository. Accessed: 2026-04-10

2026
[111]

Leanexplore: A search engine for lean 4 declarations.arXiv preprint arXiv:2506.11085,

Justin Asher. Leanexplore: A search engine for lean 4 declarations.CoRR, abs/2506.11085,

work page arXiv
[113]

Gemini 3 flash model card

Google. Gemini 3 flash model card. Technical report, Google, Decem- ber 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Flash-Model-Card.pdf. Accessed: 2026-04-09

2025
[115]

Smith, Edoardo M

Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan S. Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, and Moksh Jain. Recursive self-aggregation unlocks deep thinking in large language models.CoRR, abs/2509.26626, 2025. doi: 10.48550/ARXIV .2509.26626. URL https://doi.o...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[116]

Try to prove the following statement: {perturbed_statement}

Roger Jin, Jeffrey Quesnelle, Dakota Mahan, Chen Guang, Ryan Teknium, Jun Park, Ibrakhim Ustelbay, Samuel Kim, Miron Yurkevich, Adilet Zauytkhan, Rinat Amankos, Alex Andreyev, Damir Nurlanov, Abuzer Abuov, and Askar massiveaxe. Nomos. https://github.com/ NousResearch/nomos, 2025. GitHub repository. 18 A Benchmark Details Table 4: Benchmarks included in MA...

2025
[117]

Chain A: [idea] / Chain B: [idea] /

**Checkpoints (7 pts total)** * Break the solution into logically independent checkpoints with **integer** point values. * Allocate **>= 4 pts** to the main idea/critical steps (all combined); **<= 3 pts** to routine work (all combined).,→ * If a specific checkpoint can be achieved in multiple mathematically equivalent ways, mention this. However, only do...
[118]

**Zero-credit items** * List common arguments or observations that **earn 0 points** (e.g., conjectures without proof, especially in geometry, routine restatements of the problem, or dead-ends).,→ * Avoid redundancy with deductions below
[119]

cap at x/7

**Deductions** * Bullet each typical/expected mistake with a **flat** penalty (**1**, **2**, or **"cap at x/7"**). Use caps for major problems, while subtraction should be used for smaller arithmetic errors. Try to anticipate likely errors based on the official solution structure. ,→ ,→ * Target logic gaps, invalid claims, circular reasoning, or contradic...
[120]

**Mathematical validity** of the proof's reasoning and conclusion
[121]

**Problem constraints** (e.g., unique required final value; forbidden tools if stated)
[122]

brilliant

**Advisory mapping to the marking scheme** (checkpoints/deductions), allowing different orders and techniques. **Alternative-approach policy:** - If the proof uses a different but valid method, **map its steps to equivalent rubric checkpoints** (same logical role) and award points accordingly.,→ - Apply zero-credit items/deductions **only when the underly...
[123]

action":

specific error 2, ... </errors> ### INPUT DATA ## Problem Statement ## {problem_statement} ## Marking Scheme ## {guidelines} ## Proof Solution ## {student_answer} Judge Proof Reconciliation {judge_prompt} You are additionally given the following judgments from other judges on the same proof. They have all read the same proof and marking scheme, but they s...
[124]

**Direct derivability** The answer must be derivable *directly and unambiguously* from the abstract alone, without requiring access to the full paper or external references.,→
[125]

**Main contribution** The question must target a *primary theorem, result, or quantitative claim* of the paper, not background material, motivation, or related work.,→
[126]

**Unambiguous and objective** The question must have exactly **one correct answer**, with no dependence on interpretation, conventions, or unstated assumptions.,→
[127]

significant,

**Non-subjective** The question must not involve opinions, qualitative judgments, or vague descriptors (e.g., "significant," "large," "efficient").,→
[128]

Additionally, avoid logical expressions and inequalities

**Answer format constraint** The answer must be **either**: - a single numerical value, or - a pure LaTeX mathematical expression The answer **must not contain any English words**, including within LaTeX (symbols and variables are allowed). Additionally, avoid logical expressions and inequalities. Focus on functions, constants, formulas, or specific mathe...
[129]

**Question type restriction** The question must **not** be: - yes/no - multiple-choice - a request to prove or explain something
[130]

**Machine-verifiable** The answer must be suitable for **rule-based verification**, meaning it can be extracted and compared as a string or parsed LaTeX expression.,→
[131]

in this work

**Self-contained** The question must be understandable *on its own*. - Do **not** reference the paper, authors, or phrases like "in this work." - All notation and quantities used must be explicitly defined in the question
[132]

**No paper references in the answer** The answer must be a standalone mathematical object and must not refer to the paper, its results, or its statements.,→
[133]

**Claim needs to be proven** The authors must say they have actually proven or established the claim in the paper, not just stated it as a conjecture or open problem.,→
[134]

In particular, all variables, notation, and quantities used in the question must be explicitly defined within the question itself

**All context provided** Ensure the question contains all necessary context from the abstract to be answerable. In particular, all variables, notation, and quantities used in the question must be explicitly defined within the question itself. ,→ ,→ It is okay if questions are long, as long as they remain clear and unambiguous. 43
[135]

stated result

**Be careful with bounds** Some papers prove bounds or inequalities. These are acceptable only if the bound is stated to be tight or exact in the abstract, so that there is a unique correct answer.,→ Otherwise, such abstracts should be rejected. --- ## Examples of Unacceptable Questions - A question that is very easy, and clearly not the main contribution...

Showing first 80 references.