RMA: an Agentic System for Research-Level Mathematical Problems

Bo Yuan; Jaemoo Choi; Yongxin Chen; Zelin Zhao

arxiv: 2605.22875 · v1 · pith:YVMOLVOVnew · submitted 2026-05-20 · 💻 cs.AI · cs.LG

RMA: an Agentic System for Research-Level Mathematical Problems

Zelin Zhao , Bo Yuan , Jaemoo Choi , Yongxin Chen This is my paper

Pith reviewed 2026-05-25 06:03 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords agentic frameworkmathematical reasoningresearch-level problemsproof generationmulti-agent workflowiterative refinementliterature groundingautomated reasoning

0 comments

The pith

RMA solves eight out of ten research-level mathematical problems by coordinating agents for analysis, literature search, and iterative proof refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Research Math Agents (RMA), an agentic framework for tackling research-level mathematical problems that demand long-horizon reasoning, literature grounding, and iterative proof refinement. RMA breaks these problems into specialized modules including problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification. These modules are managed by initializer, proposer, and verifier agents through a shared structured memory in a multi-role, multi-round workflow. On the First Proof benchmark of ten problems contributed by expert mathematicians, RMA outperforms baselines such as GPT-5.2R and Aletheia by solving eight problems and generating more logically sound and readable proofs. Ablation studies confirm that the gains result from the combined interaction of the structured modules, iterative refinement, and verifier feedback.

Core claim

RMA decomposes research-level proof solving into specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, all coordinated by initializer, proposer, and verifier agents through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow, collaboratively generating, refining, and verifying candidate proofs through iterative feedback. This enables solving eight out of ten research problems on the First Proof benchmark with more logically sound and readable proofs than strong baselines.

What carries the argument

The multi-role, multi-round workflow with shared structured memory coordinating specialized modules for analysis, literature search, comparison, knowledge construction, and verification.

If this is right

Performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback rather than any single component.
The framework can address long-horizon reasoning and literature grounding required in research-level mathematics beyond competition problems or formal theorem proving.
RMA produces proofs rated higher in logical soundness and readability by experts compared to baselines including GPT-5.2R and Aletheia.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared structured memory may support scaling the approach to problems that require synthesizing information across many papers.
Applying the same module decomposition to domains like theoretical computer science could test whether the workflow generalizes beyond the tested benchmark.
Adding formal verification tools to the verifier module might reduce reliance on expert human judgment for soundness.

Load-bearing premise

Expert evaluation of the generated proofs is reliable, unbiased, and consistent across the ten contributed problems, and the First Proof benchmark accurately captures the requirements of genuine research-level mathematics.

What would settle it

Independent expert re-evaluation of the proofs finding that fewer than six of the eight claimed solutions are logically sound and readable, or failure to solve a majority of additional research problems contributed by mathematicians in the same domains.

Figures

Figures reproduced from arXiv: 2605.22875 by Bo Yuan, Jaemoo Choi, Yongxin Chen, Zelin Zhao.

**Figure 1.** Figure 1: A sample math problem in First Proof [16] and solution comparisons in summary. Aletheia [17] has no successful output for this problem. GPT-5.2R from the reasoning team of OpenAI [18] derives a looser bound and exhibits minor issues, such as reference hallucinations, as identified by mathematicians. Ours produces a better bound than GPT-5.2R and passes expert checks. existing auto-research agents [26, 27] … view at source ↗

**Figure 2.** Figure 2: Overview of RMA for automated reasoning on research-level mathematics. (Left) The system is built upon six specialized modules that support problem formalization, literature grounding, knowledge reuse, and disciplined proof construction. (Middle) A multi-agent architecture comprising initializer, proposer, and verifier agents interacts through a shared structured memory, with role-specific read/write permi… view at source ↗

read the original abstract

We present $\textbf{Research Math Agents (RMA)}$, an agentic framework for automated reasoning on research-level mathematical problems. Unlike prior studies centered on competition mathematics or formal theorem proving, RMA targets research-level mathematical problems that require long-horizon reasoning, literature grounding, and iterative proof refinement. RMA decomposes research-level proof solving into specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, all coordinated by initializer, proposer, and verifier agents through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow, collaboratively generating, refining, and verifying candidate proofs through iterative feedback. We evaluate RMA on the First Proof benchmark, which consists of ten research-level problems contributed by expert mathematicians across diverse domains. Through comprehensive expert evaluation, RMA outperforms strong baselines on the First Proof benchmark, including GPT-5.2R and Aletheia, solving eight out of ten research problems and producing more logically sound and readable proofs. Our comprehensive ablation studies further show that performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback, rather than any single component. Our solutions and implementations will be made publicly available upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents Research Math Agents (RMA), a multi-agent framework that decomposes research-level proof construction into modules for problem analysis, literature search, fair comparison, knowledge-bank construction, and verification. These are coordinated via initializer, proposer, and verifier agents operating in a multi-round workflow with shared structured memory. The system is evaluated on the newly introduced First Proof benchmark consisting of ten research-level problems contributed by expert mathematicians; the central claim is that RMA solves eight of the ten problems, outperforms baselines including GPT-5.2R and Aletheia, and produces more logically sound and readable proofs according to expert judgment, with ablations attributing gains to module interaction and iterative verifier feedback.

Significance. If the empirical claims were supported by transparent, reproducible evaluation, the work would be significant for shifting automated reasoning research from contest-style or formally verified problems toward open research mathematics that requires literature grounding and long-horizon iteration. The modular agent design and emphasis on iterative refinement represent a concrete architectural proposal that could be tested in other domains.

major comments (3)

[Abstract and §4] Abstract and §4 (Evaluation): The headline result that RMA solves 8/10 problems and outperforms GPT-5.2R and Aletheia rests entirely on expert judgments of logical soundness and readability, yet the manuscript supplies no description of the evaluation protocol, blinding procedure, rubric, inter-rater agreement statistics, or how the ten contributed problems were confirmed to be genuine open research questions rather than contest-style items. Without these details the central empirical claim cannot be assessed.
[Abstract and §3] Abstract and §3 (Benchmark): The First Proof benchmark is defined and contributed within the paper itself, creating an unaddressed risk of circularity; no external validation, pre-registration, or independent confirmation that the problems meet the stated criteria of research-level mathematics is reported. This directly affects the reliability of the 8/10 success rate.
[Abstract] Abstract: Ablation studies are invoked to attribute performance gains to module interaction and verifier feedback, but the manuscript provides neither the ablation table, the exact configurations tested, nor quantitative deltas, rendering the causal claim about component contributions unverifiable from the given text.

minor comments (1)

[Abstract] The abstract refers to 'comprehensive expert evaluation' and 'comprehensive ablation studies' without cross-references to specific tables or appendices that would allow readers to locate the supporting data.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where greater transparency is needed in the evaluation protocol, benchmark documentation, and ablation results. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): The headline result that RMA solves 8/10 problems and outperforms GPT-5.2R and Aletheia rests entirely on expert judgments of logical soundness and readability, yet the manuscript supplies no description of the evaluation protocol, blinding procedure, rubric, inter-rater agreement statistics, or how the ten contributed problems were confirmed to be genuine open research questions rather than contest-style items. Without these details the central empirical claim cannot be assessed.

Authors: We agree that the evaluation protocol requires explicit documentation. In the revised manuscript we will add a dedicated subsection in §4 that fully describes the expert evaluation procedure. This will include the scoring rubric for logical soundness and readability, the blinding protocol (experts received anonymized proofs without system identifiers), inter-rater agreement statistics, and the vetting process used by contributing mathematicians to confirm the problems represent open research questions rather than contest-style items. The evaluation guidelines will also be provided as supplementary material. revision: yes
Referee: [Abstract and §3] Abstract and §3 (Benchmark): The First Proof benchmark is defined and contributed within the paper itself, creating an unaddressed risk of circularity; no external validation, pre-registration, or independent confirmation that the problems meet the stated criteria of research-level mathematics is reported. This directly affects the reliability of the 8/10 success rate.

Authors: The First Proof benchmark is a new contribution introduced in this paper. Problems were curated in direct consultation with expert mathematicians who confirmed they require literature grounding and long-horizon iteration. Pre-registration was not feasible because the benchmark was developed concurrently with the system. In revision we will expand §3 with additional documentation of the curation criteria and anonymized expert input on problem selection. We will also release the full benchmark publicly upon acceptance to enable independent verification. revision: partial
Referee: [Abstract] Abstract: Ablation studies are invoked to attribute performance gains to module interaction and verifier feedback, but the manuscript provides neither the ablation table, the exact configurations tested, nor quantitative deltas, rendering the causal claim about component contributions unverifiable from the given text.

Authors: We acknowledge that the ablation results were summarized rather than presented in full tabular form. The revised manuscript will include a complete ablation table in §4 listing all tested configurations (e.g., without structured memory, without iterative verifier feedback), the exact parameter settings, and quantitative deltas for both success rate and expert scores. This will allow readers to verify the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected.

full rationale

The paper presents an empirical agentic framework evaluated on a newly introduced benchmark via expert judgments. No equations, fitted parameters, predictions, or derivations appear in the provided text. The central performance claims (8/10 solved, outperforming baselines) rest on external-style empirical reporting rather than any self-definitional construct, fitted-input-as-prediction, or self-citation load-bearing step that reduces the result to its inputs by construction. No enumerated circularity pattern is exhibited with a quotable reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no visible free parameters, axioms, or invented entities; the system description relies on standard multi-agent concepts without additional postulates.

pith-pipeline@v0.9.0 · 5757 in / 1163 out tokens · 28336 ms · 2026-05-25T06:03:32.361951+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 9 internal anchors

[1]

The logic theory machine-a complex information processing system

A Newell and H Simon. The logic theory machine-a complex information processing system. IRE Transactions on Information Theory, 2(3):61–79, 1956

work page 1956
[2]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[3]

Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

work page 2024
[4]

Llemma: An Open Language Model For Mathematics

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InAutomated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, page 625–635, Berlin, Heidelberg,

work page 2021
[6]

End-to-end differentiable proving.Advances in neural information processing systems, 30, 2017

Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving.Advances in neural information processing systems, 30, 2017

work page 2017
[7]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[8]

Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

work page 2022
[9]

Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

work page 2023
[10]

Ai achieves silver-medal standard solving interna- tional 178 mathematical olympiad problems.DeepMind blog, 179:45, 2024

Team AlphaProof and Team AlphaGeometry. Ai achieves silver-medal standard solving interna- tional 178 mathematical olympiad problems.DeepMind blog, 179:45, 2024

work page 2024
[11]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page 2025
[12]

Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.https: //arxiv.org/abs/2405.14333, 2024

Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.arXiv preprint arXiv:2405.14333, 2024

work page arXiv 2024
[13]

Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis H...

work page 2026
[14]

autoresearch.https://github.com/karpathy/autoresearch

Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch

work page
[15]

The agentic researcher: A practical guide to ai-assisted research in mathematics and machine learning, 2026

Max Zimmer, Nico Pelleriti, Christophe Roux, and Sebastian Pokutta. The agentic researcher: A practical guide to ai-assisted research in mathematics and machine learning, 2026

work page 2026
[16]

First proof.arXiv preprint arXiv:2602.05192, 2026

Mohammed Abouzaid, Andrew J Blumberg, Martin Hairer, Joe Kileel, Tamara G Kolda, Paul D Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, et al. First proof.arXiv preprint arXiv:2602.05192, 2026. 10

work page arXiv 2026
[17]

Lee, Garrett Bingham, Trieu H

Tony Feng, Junehyuk Jung, Sang hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V . Le, and Thang Luong. Aletheia tackles firstproof autonomously, 2026

work page 2026
[18]

First proof submissions

OpenAI. First proof submissions. https://openai.com/index/ first-proof-submissions/, 2025. Accessed: 2026-04-18

work page 2025
[19]

Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28489–28503, 2025

work page 2025
[20]

Agentic strategy design for math proofs

Dietmar Wolz. Agentic strategy design for math proofs. https://althofer.de/agentic_ strategy_design_for_math_proofs.pdf, 2026. Accessed February 12, 2026

work page 2026
[21]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[22]

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701, 2025

Bochen Han and Songmao Zhang. Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701, 2025

work page arXiv 2025
[24]

Achieving> 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems.Frontiers of Computer Science, 20(1):1–3, 2026

Qihuang Zhong, Kang Wang, Ziyang Xu, Liang Ding, Juhua Liu, and Bo Du. Achieving> 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems.Frontiers of Computer Science, 20(1):1–3, 2026

work page 2026
[25]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[26]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025. Research leads: Isa Fulford, Zhiqing Sun

work page 2025
[27]

Gemini deep research

Google DeepMind. Gemini deep research. https://gemini.google/overview/ deep-research/, 2025. AI research assistant system

work page 2025
[28]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[29]

Evaluating mathematical reasoning of large language models: A focus on error identification and correction

Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11316–11360, 2024

work page 2024
[30]

Mathematical capabilities of chatgpt

Simon Frieder, Luca Pinchetti, Chevalier Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. Advances in neural information processing systems, 36:27699–27744, 2023

work page 2023
[31]

Taylor, Frank H

Benjamin Breen, Marco Del Tredici, Jacob McCarran, Javier Aspuru Mijares, Weichen Winston Yin, Kfir Sulimany, Jacob M. Taylor, Frank H. L. Koppens, and Dirk Englund. Ax-prover: A deep reasoning agentic framework for theorem proving in mathematics and quantum physics, 2025

work page 2025
[32]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021

work page 2021
[34]

A diverse corpus for evaluating and developing english math word problem solvers

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers. InProceedings of the 58th annual meeting of the Association for Computational Linguistics, pages 975–984, 2020

work page 2020
[35]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024
[37]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Learning Explanatory Rules from Noisy Data

Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data.CoRR, abs/1711.04574, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Lemmahead: Rag assisted proof generation using large language models.ArXiv, abs/2501.15797, 2025

Tianbo Yang, Mingqi Yang, Hongyi Zhao, and Tianshuo Yang. Lemmahead: Rag assisted proof generation using large language models.ArXiv, abs/2501.15797, 2025

work page arXiv 2025
[41]

Real-prover: Retrieval augmented lean prover for mathematical reasoning.arXiv preprint arXiv:2505.20613, 2025

Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, et al. Real-prover: Retrieval augmented lean prover for mathematical reasoning.arXiv preprint arXiv:2505.20613, 2025

work page arXiv 2025
[42]

Proof artifact co-training for theorem proving with language models

Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W Ayers, and Stanislas Polu. Proof artifact co-training for theorem proving with language models. InInternational Conference on Learning Representations, 2022

work page 2022
[43]

Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337–26349, 2022

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337–26349, 2022

work page 2022
[44]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs

Albert Qiaochu Jiang, Sean Welleck, Jin Peng Zhou, Timothee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[45]

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024

work page 2024
[46]

Mario: Math reasoning with code interpreter output-a reproducible pipeline

Minpeng Liao, Chengxi Li, Wei Luo, Wu Jing, and Kai Fan. Mario: Math reasoning with code interpreter output-a reproducible pipeline. InFindings of the Association for Computational Linguistics: ACL 2024, pages 905–924, 2024

work page 2024
[47]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 12

work page 2022
[48]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[49]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[50]

Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

work page 2023
[51]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, pages 10764–10799. PMLR, 2023

work page 2023
[52]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

work page 2023
[53]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[54]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024
[55]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[56]

Step-wise formal verification for llm-based mathematical problem solving.arXiv preprint arXiv:2505.20869, 2025

Kuo Zhou and Lu Zhang. Step-wise formal verification for llm-based mathematical problem solving.arXiv preprint arXiv:2505.20869, 2025

work page arXiv 2025
[57]

Codex | ai coding partner from openai

OpenAI. Codex | ai coding partner from openai. https://openai.com/codex/. Accessed: 2026-04-26

work page 2026
[58]

Claude code.https://code.claude.com/docs/en/verview

Anthropic. Claude code.https://code.claude.com/docs/en/verview

work page
[59]

Build, debug & deploy with ai.https://geminicli.com/

Google. Build, debug & deploy with ai.https://geminicli.com/. Accessed: 2026-04-26

work page 2026
[60]

arxiv e-print archive.https://arxiv.org

Cornell University. arxiv e-print archive.https://arxiv.org. Accessed: 2026

work page 2026
[61]

Mathscinet

American Mathematical Society. Mathscinet. https://mathscinet.ams.org. Accessed: 2026

work page 2026
[62]

An overview of zbmath open digital library

Madhurima Deb, Isabel Beckenbach, Matteo Petrera, Dariush Ehsani, Marcel Fuhrmann, Yun Hao, Olaf Teschke, and Moritz Schubotz. An overview of zbmath open digital library. In Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries, JCDL ’24, page 1–5. ACM, December 2024

work page 2024
[63]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Elsevier, 2004

Clive Dym.Principles of mathematical modeling. Elsevier, 2004

work page 2004
[65]

Cambridge university press, 2004

Stephen Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004

work page 2004
[66]

Cambridge university press, 2017

Michael Mitzenmacher and Eli Upfal.Probability and computing: Randomization and proba- bilistic techniques in algorithms and data analysis. Cambridge university press, 2017. 13

work page 2017
[67]

Towards mitigating llm hallucination via self reflection

Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm hallucination via self reflection. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, 2023

work page 2023
[68]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 14 Table 4: The 10 problems of the First Proof benchmark. # Area T...

work page 2022

[1] [1]

The logic theory machine-a complex information processing system

A Newell and H Simon. The logic theory machine-a complex information processing system. IRE Transactions on Information Theory, 2(3):61–79, 1956

work page 1956

[2] [2]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021

[3] [3]

Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

work page 2024

[4] [4]

Llemma: An Open Language Model For Mathematics

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InAutomated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, page 625–635, Berlin, Heidelberg,

work page 2021

[6] [6]

End-to-end differentiable proving.Advances in neural information processing systems, 30, 2017

Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving.Advances in neural information processing systems, 30, 2017

work page 2017

[7] [7]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[8] [8]

Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

work page 2022

[9] [9]

Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

work page 2023

[10] [10]

Ai achieves silver-medal standard solving interna- tional 178 mathematical olympiad problems.DeepMind blog, 179:45, 2024

Team AlphaProof and Team AlphaGeometry. Ai achieves silver-medal standard solving interna- tional 178 mathematical olympiad problems.DeepMind blog, 179:45, 2024

work page 2024

[11] [11]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page 2025

[12] [12]

Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.https: //arxiv.org/abs/2405.14333, 2024

Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.arXiv preprint arXiv:2405.14333, 2024

work page arXiv 2024

[13] [13]

Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis H...

work page 2026

[14] [14]

autoresearch.https://github.com/karpathy/autoresearch

Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch

work page

[15] [15]

The agentic researcher: A practical guide to ai-assisted research in mathematics and machine learning, 2026

Max Zimmer, Nico Pelleriti, Christophe Roux, and Sebastian Pokutta. The agentic researcher: A practical guide to ai-assisted research in mathematics and machine learning, 2026

work page 2026

[16] [16]

First proof.arXiv preprint arXiv:2602.05192, 2026

Mohammed Abouzaid, Andrew J Blumberg, Martin Hairer, Joe Kileel, Tamara G Kolda, Paul D Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, et al. First proof.arXiv preprint arXiv:2602.05192, 2026. 10

work page arXiv 2026

[17] [17]

Lee, Garrett Bingham, Trieu H

Tony Feng, Junehyuk Jung, Sang hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V . Le, and Thang Luong. Aletheia tackles firstproof autonomously, 2026

work page 2026

[18] [18]

First proof submissions

OpenAI. First proof submissions. https://openai.com/index/ first-proof-submissions/, 2025. Accessed: 2026-04-18

work page 2025

[19] [19]

Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28489–28503, 2025

work page 2025

[20] [20]

Agentic strategy design for math proofs

Dietmar Wolz. Agentic strategy design for math proofs. https://althofer.de/agentic_ strategy_design_for_math_proofs.pdf, 2026. Accessed February 12, 2026

work page 2026

[21] [21]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[22] [22]

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701, 2025

Bochen Han and Songmao Zhang. Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701, 2025

work page arXiv 2025

[24] [24]

Achieving> 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems.Frontiers of Computer Science, 20(1):1–3, 2026

Qihuang Zhong, Kang Wang, Ziyang Xu, Liang Ding, Juhua Liu, and Bo Du. Achieving> 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems.Frontiers of Computer Science, 20(1):1–3, 2026

work page 2026

[25] [25]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024

[26] [26]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025. Research leads: Isa Fulford, Zhiqing Sun

work page 2025

[27] [27]

Gemini deep research

Google DeepMind. Gemini deep research. https://gemini.google/overview/ deep-research/, 2025. AI research assistant system

work page 2025

[28] [28]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023

[29] [29]

Evaluating mathematical reasoning of large language models: A focus on error identification and correction

Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11316–11360, 2024

work page 2024

[30] [30]

Mathematical capabilities of chatgpt

Simon Frieder, Luca Pinchetti, Chevalier Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. Advances in neural information processing systems, 36:27699–27744, 2023

work page 2023

[31] [31]

Taylor, Frank H

Benjamin Breen, Marco Del Tredici, Jacob McCarran, Javier Aspuru Mijares, Weichen Winston Yin, Kfir Sulimany, Jacob M. Taylor, Frank H. L. Koppens, and Dirk Englund. Ax-prover: A deep reasoning agentic framework for theorem proving in mathematics and quantum physics, 2025

work page 2025

[32] [32]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021

[33] [33]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021

work page 2021

[34] [34]

A diverse corpus for evaluating and developing english math word problem solvers

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers. InProceedings of the 58th annual meeting of the Association for Computational Linguistics, pages 975–984, 2020

work page 2020

[35] [35]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024

[37] [37]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Learning Explanatory Rules from Noisy Data

Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data.CoRR, abs/1711.04574, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Lemmahead: Rag assisted proof generation using large language models.ArXiv, abs/2501.15797, 2025

Tianbo Yang, Mingqi Yang, Hongyi Zhao, and Tianshuo Yang. Lemmahead: Rag assisted proof generation using large language models.ArXiv, abs/2501.15797, 2025

work page arXiv 2025

[41] [41]

Real-prover: Retrieval augmented lean prover for mathematical reasoning.arXiv preprint arXiv:2505.20613, 2025

Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, et al. Real-prover: Retrieval augmented lean prover for mathematical reasoning.arXiv preprint arXiv:2505.20613, 2025

work page arXiv 2025

[42] [42]

Proof artifact co-training for theorem proving with language models

Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W Ayers, and Stanislas Polu. Proof artifact co-training for theorem proving with language models. InInternational Conference on Learning Representations, 2022

work page 2022

[43] [43]

Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337–26349, 2022

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337–26349, 2022

work page 2022

[44] [44]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs

Albert Qiaochu Jiang, Sean Welleck, Jin Peng Zhou, Timothee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[45] [45]

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024

work page 2024

[46] [46]

Mario: Math reasoning with code interpreter output-a reproducible pipeline

Minpeng Liao, Chengxi Li, Wei Luo, Wu Jing, and Kai Fan. Mario: Math reasoning with code interpreter output-a reproducible pipeline. InFindings of the Association for Computational Linguistics: ACL 2024, pages 905–924, 2024

work page 2024

[47] [47]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 12

work page 2022

[48] [48]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[49] [49]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023

[50] [50]

Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

work page 2023

[51] [51]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, pages 10764–10799. PMLR, 2023

work page 2023

[52] [52]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

work page 2023

[53] [53]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022

[54] [54]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024

[55] [55]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[56] [56]

Step-wise formal verification for llm-based mathematical problem solving.arXiv preprint arXiv:2505.20869, 2025

Kuo Zhou and Lu Zhang. Step-wise formal verification for llm-based mathematical problem solving.arXiv preprint arXiv:2505.20869, 2025

work page arXiv 2025

[57] [57]

Codex | ai coding partner from openai

OpenAI. Codex | ai coding partner from openai. https://openai.com/codex/. Accessed: 2026-04-26

work page 2026

[58] [58]

Claude code.https://code.claude.com/docs/en/verview

Anthropic. Claude code.https://code.claude.com/docs/en/verview

work page

[59] [59]

Build, debug & deploy with ai.https://geminicli.com/

Google. Build, debug & deploy with ai.https://geminicli.com/. Accessed: 2026-04-26

work page 2026

[60] [60]

arxiv e-print archive.https://arxiv.org

Cornell University. arxiv e-print archive.https://arxiv.org. Accessed: 2026

work page 2026

[61] [61]

Mathscinet

American Mathematical Society. Mathscinet. https://mathscinet.ams.org. Accessed: 2026

work page 2026

[62] [62]

An overview of zbmath open digital library

Madhurima Deb, Isabel Beckenbach, Matteo Petrera, Dariush Ehsani, Marcel Fuhrmann, Yun Hao, Olaf Teschke, and Moritz Schubotz. An overview of zbmath open digital library. In Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries, JCDL ’24, page 1–5. ACM, December 2024

work page 2024

[63] [63]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Elsevier, 2004

Clive Dym.Principles of mathematical modeling. Elsevier, 2004

work page 2004

[65] [65]

Cambridge university press, 2004

Stephen Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004

work page 2004

[66] [66]

Cambridge university press, 2017

Michael Mitzenmacher and Eli Upfal.Probability and computing: Randomization and proba- bilistic techniques in algorithms and data analysis. Cambridge university press, 2017. 13

work page 2017

[67] [67]

Towards mitigating llm hallucination via self reflection

Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm hallucination via self reflection. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, 2023

work page 2023

[68] [68]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 14 Table 4: The 10 problems of the First Proof benchmark. # Area T...

work page 2022