Recognition: unknown
Bolzano: Case Studies in LLM-Assisted Mathematical Research
Pith reviewed 2026-05-10 06:58 UTC · model grok-4.3
The pith
An LLM multi-agent system generates five essentially autonomous publishable results on math and theoretical computer science problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The system produces new results on eight problems by orchestrating rounds of interaction between parallel prover agents and a verifier agent while maintaining a persistent knowledge base carried across rounds. When classified by significance and autonomy, six results reach publishable research level and five were produced essentially autonomously by the system itself.
What carries the argument
A multi-agent LLM system that coordinates parallel prover agents with a verifier agent and maintains a persistent knowledge base across interaction rounds.
If this is right
- LLMs can reach publishable standards on selected problems in mathematics and theoretical computer science.
- Multi-agent prover-verifier loops with shared memory support extended autonomous solving sessions.
- Five of eight cases required only minimal human guidance after initial problem statements.
- The same architecture applies across both pure mathematics and theoretical computer science questions.
- Such systems can complement existing human-led research workflows.
Where Pith is reading between the lines
- Similar agent loops with persistent state could be tested on open problems that require chaining many steps.
- If the autonomy holds under scrutiny, the approach points toward LLMs handling larger portions of research pipelines.
- The pattern of prover and verifier agents may transfer to other verification-heavy domains such as formal methods or experimental design.
Load-bearing premise
The generated results are correct and novel and the claimed level of autonomy reflects the actual process without undisclosed human intervention.
What would settle it
Independent verification that finds an error in one of the eight claimed results or documentation showing more human input than stated would falsify the autonomy and publishability claims.
read the original abstract
We report new results on eight problems in mathematics and theoretical computer science, produced with the assistance of Bolzano, an open-source multi-agent LLM system. Bolzano orchestrates rounds of interaction between parallel prover agents and a verifier agent while maintaining a persistent knowledge base that is carried across rounds. Classified using the significance-autonomy taxonomy of Feng et al., six of the eight results reach the level of publishable research, and five of the eight were produced essentially autonomously by Bolzano. Our results provide evidence that LLMs can contribute meaningfully to mathematical research, complementing recent reports by Bubeck et al., Woodruff et al., and others.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Bolzano, an open-source multi-agent LLM system that orchestrates prover and verifier agents with a persistent knowledge base, and reports eight case studies in mathematics and theoretical computer science. Using the Feng et al. significance-autonomy taxonomy, it claims that six of the eight results reach publishable-research level and five were produced essentially autonomously by Bolzano, providing evidence that LLMs can contribute meaningfully to mathematical research.
Significance. If the correctness and autonomy claims hold, the work supplies concrete, open-source case studies that complement reports by Bubeck et al. and Woodruff et al., demonstrating a practical multi-agent workflow with persistent state. The open-source release of the system itself is a strength that enables reproducibility and further experimentation.
major comments (3)
- [Case studies (results classification)] The central claim that five results are 'essentially autonomous' (per Feng et al. taxonomy) is load-bearing yet unsupported by full agent-interaction logs, problem-choice records, or blinded external review; internal classification by the authors risks circularity in assessing autonomy.
- [Results and evaluation] No independent verification (formal proof checking, external expert review, or journal submission) of the six 'publishable' results is provided; given known risks of subtle errors in LLM-generated mathematics, this undermines the publishability claim.
- [Bolzano system description and case studies] The paper does not report the exact prompting strategies, iteration counts, or human interventions in problem selection and refinement for the 'autonomous' cases, making it impossible to assess whether the autonomy threshold was met without undisclosed assistance.
minor comments (2)
- [Abstract] The abstract would benefit from one-sentence descriptions of the mathematical domains or specific problems addressed in the eight case studies.
- [Introduction] Ensure all references to the Feng et al. taxonomy include the full citation and a brief summary of the relevant categories used for classification.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important areas where additional evidence and transparency can strengthen the manuscript. We address each major comment below and commit to a major revision that incorporates more detailed logs, verification steps, and system usage information while preserving the core contribution of the open-source Bolzano system and its case studies.
read point-by-point responses
-
Referee: [Case studies (results classification)] The central claim that five results are 'essentially autonomous' (per Feng et al. taxonomy) is load-bearing yet unsupported by full agent-interaction logs, problem-choice records, or blinded external review; internal classification by the authors risks circularity in assessing autonomy.
Authors: We agree that the current manuscript provides insufficient raw material for independent assessment of the autonomy classification. In the revision we will add (or link to) the complete agent-interaction traces and problem-choice records for the five cases classified as essentially autonomous. We will also include an explicit appendix that maps each case to the precise Feng et al. criteria, documenting the exact points at which human input occurred (limited to initial problem statement and system launch for those five cases). While a fully blinded external review cannot be performed retroactively for this submission, the added traces will allow any reader to apply the taxonomy themselves and thereby reduce the risk of circularity. revision: yes
-
Referee: [Results and evaluation] No independent verification (formal proof checking, external expert review, or journal submission) of the six 'publishable' results is provided; given known risks of subtle errors in LLM-generated mathematics, this undermines the publishability claim.
Authors: We accept that the absence of external verification weakens the strength of the 'publishable-research level' claim. The revised manuscript will (i) state more clearly that the six results are offered as case studies of the system rather than as independently certified theorems, (ii) add formal verification output (where the statements admit it) using existing theorem provers, and (iii) report the current submission status of each result to journals or conferences. We will also expand the discussion of the verifier agent's role and the internal consistency checks performed. These additions will not eliminate all risk of subtle errors but will make the evidential basis more transparent. revision: partial
-
Referee: [Bolzano system description and case studies] The paper does not report the exact prompting strategies, iteration counts, or human interventions in problem selection and refinement for the 'autonomous' cases, making it impossible to assess whether the autonomy threshold was met without undisclosed assistance.
Authors: We will revise the system-description and case-study sections to include the precise prompting templates (both prover and verifier), the average and range of iteration counts per case, and a tabulated record of every human intervention. For the five autonomous cases the table will show that interventions were confined to launching the system and, in two instances, restarting after an infrastructure failure; no mathematical content was supplied or edited by humans after the initial problem statement. This level of detail will allow readers to judge whether the autonomy threshold was met. revision: yes
Circularity Check
No circularity: empirical case studies with external taxonomy
full rationale
The paper reports outcomes from an LLM system on eight mathematical problems without presenting any mathematical derivation, model, or first-principles chain. Results are classified using the external significance-autonomy taxonomy of Feng et al. (distinct authors) and compared to independent reports by Bubeck et al. and Woodruff et al. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear. Claims rest on empirical observation and external classification rather than internal reduction to inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Blumberg, Martin Hairer, Joe Kileel, Tamara G
Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams. First Proof , 2026
2026
-
[2]
N. Alon, O. Goldreich, J. H stad, and R. Peralta. Simple constructions of almost k -wise independent random variables. In 31st A nnual S ymposium on F oundations of C omputer S cience, V ol.\ I , II ( S t.\ L ouis, MO , 1990) , pages 544--553. IEEE Comput. Soc. Press, Los Alamitos, CA, 1990
1990
-
[3]
Ax-Prover : A deep reasoning agentic framework for theorem proving in mathematics and quantum physics, 2025
Benjamin Breen et al. Ax-Prover : A deep reasoning agentic framework for theorem proving in mathematics and quantum physics, 2025
2025
-
[4]
S. Barman. Approximating N ash equilibria and dense subgraphs via an approximate version of C arath\'eodory's theorem. SIAM J. Comput. , 47(3):960--981, 2018
2018
-
[5]
Babichenko, S
Y. Babichenko, S. Barman, and R. Peretz. Simple approximate equilibria in large games. In Proceedings of the Fifteenth ACM Conference on Economics and Computation , EC '14, page 753–770, New York, NY, USA, 2014. Association for Computing Machinery
2014
-
[6]
Spears, Derya Unutmaz, Kevin Weil, Steven Yin, and Nikita Zhivotovskiy
S\' e bastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee, Alexandru Lupsasca, Mehtaab Sawhney, Robert Scherrer, Mark Sellke, Brian K. Spears, Derya Unutmaz, Kevin Weil, Steven Yin, and Nikita Zhivotovskiy. Early science acceleration experiments with GPT -5, 2025
2025
-
[7]
Efficient polynomial commitment schemes for multiple points and polynomials
Dan Boneh, Justin Drake, Ben Fisch, and Ariel Gabizon. Efficient polynomial commitment schemes for multiple points and polynomials. IACR Cryptol. ePrint Arch. , page 81, 2020
2020
-
[8]
Periodicity and decidability of tilings of Z ^2
Siddhartha Bhattacharya. Periodicity and decidability of tilings of Z ^2 . American Journal of Mathematics , 142(1):255--266, 2020
2020
-
[9]
Bessy, F
S. Bessy, F. Havet, and E. Birmel\' e . Arc-chromatic number of digraphs in which every vertex has bounded outdegree or bounded indegree. J. Graph Theory , 53(4):315--332, December 2006
2006
-
[10]
CHOPIN : Optimal pairing-based multilinear polynomial commitments from bivariate KZG
Juraj Belohorec, Pavel Hub \'a c ek, Aleksi Kalsta, and Krist \'y na Ma s kov \'a . CHOPIN : Optimal pairing-based multilinear polynomial commitments from bivariate KZG . IACR Cryptol. ePrint Arch. , page 480, 2026. https://eprint.iacr.org/archive/2026/480/20260308:102759
2026
-
[11]
Buss and Alan S
Samuel R. Buss and Alan S. Johnson. Propositional proofs and reductions between NP search problems. Annals of Pure and Applied Logic , 163(9):1163--1182, 2012
2012
-
[12]
https://bolzano.app/heap-equivalence, 2025
Bolzano transcript: Heap equivalence. https://bolzano.app/heap-equivalence, 2025
2025
-
[13]
Bolzano Team . Bolzano. https://bolzano.app/, 2025
2025
-
[14]
https://bolzano.app/cce-supports, 2026
Bolzano transcript: CCE supports. https://bolzano.app/cce-supports, 2026
2026
-
[15]
https://bolzano.app/kkos-optimization, 2026
Bolzano transcript: KKOS optimization. https://bolzano.app/kkos-optimization, 2026
2026
-
[16]
https://bolzano.app/kzg-batching, 2026
Bolzano transcript: KZG batching. https://bolzano.app/kzg-batching, 2026
2026
-
[17]
https://bolzano.app/function-preimage-partitioning, 2026
Bolzano transcript: Function-preimage partitioning. https://bolzano.app/function-preimage-partitioning, 2026
2026
-
[18]
https://bolzano.app/pwpp-separation, 2026
Bolzano transcript: PWPP separation. https://bolzano.app/pwpp-separation, 2026
2026
-
[19]
https://bolzano.app/multi-slope-tilings, 2026
Bolzano transcript: Multi-slope tilings. https://bolzano.app/multi-slope-tilings, 2026
2026
-
[20]
https://bolzano.app/problem/ade3d456-8788-4d11-b55f-926644ddbc42/files?file_id=d056ac12-6ef7-496a-a790-eba9105816bb, 2026
Bolzano transcript: Interleaving and W ilber's first bound. https://bolzano.app/problem/ade3d456-8788-4d11-b55f-926644ddbc42/files?file_id=d056ac12-6ef7-496a-a790-eba9105816bb, 2026
2026
-
[21]
Balko and T
M. Balko and T. C \' i z ek. Approximating nash equilibria in sparse games. In preparation, 2026
2026
-
[22]
Settling the complexity of computing two-player N ash equilibria
Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Settling the complexity of computing two-player N ash equilibria. J. ACM , 56(3):Art. 14, 57, 2009
2009
-
[23]
The Open Proof Corpus : A large-scale study of LLM -generated mathematical proofs, 2025
Jasper Dekoninck et al. The Open Proof Corpus : A large-scale study of LLM -generated mathematical proofs, 2025
2025
-
[24]
Translational tilings by axes-parallel polygonal sets
Jaume de Dios Pont , Jan Greb\' i k, Rachel Greenfeld, and Jos\' e Madrid. Translational tilings by axes-parallel polygonal sets. Work in progress, 2026
2026
-
[25]
Goldberg, and Christos H
Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a N ash equilibrium. SIAM J. Comput. , 39(1):195--259, 2009
2009
-
[26]
Demaine, Dion Harmon, John Iacono, and Mihai P a tra s cu
Erik D. Demaine, Dion Harmon, John Iacono, and Mihai P a tra s cu. Dynamic optimality---almost. SIAM J. Comput. , 37(1):240--251, 2007
2007
-
[27]
Tenenbaum, and Igor Mordatch
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023
2023
-
[28]
A priority queue with the working-set property
Amr Elmasry. A priority queue with the working-set property. Int. J. Found. Comput. Sci. , 17(6):1455--1466, 2006
2006
-
[29]
Semi-autonomous mathematics discovery with G emini: A case study on the Erd o s problems, 2026
Tony Feng et al. Semi-autonomous mathematics discovery with G emini: A case study on the Erd o s problems, 2026
2026
-
[30]
Towards autonomous mathematics research, 2026
Tony Feng et al. Towards autonomous mathematics research, 2026
2026
-
[31]
Opinion diffusion and campaigning on society graphs
Piotr Faliszewski, Rica Gonen, Martin Kouteck \' y , and Nimrod Talmon. Opinion diffusion and campaigning on society graphs. J. Log. Comput. , 32(6):1162--1194, 2022
2022
-
[32]
Black-box PPP is not T uring-closed
Noah Fleming, Stefan Grosser, Toniann Pitassi, and Robert Robere. Black-box PPP is not T uring-closed. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC) , pages 1405--1414, 2024
2024
-
[33]
Aletheia tackles FirstProof autonomously, 2026
Tony Feng, Junehyuk Jung, Sang hyun Kim, Carlo Pagano, et al. Aletheia tackles FirstProof autonomously, 2026
2026
-
[34]
Right is not enough: The pitfalls of outcome supervision in training LLMs for math reasoning, 2025
Jiaxing Guo et al. Right is not enough: The pitfalls of outcome supervision in training LLMs for math reasoning, 2025
2025
-
[35]
Mathematical exploration and discovery at scale, 2025
Bogdan Georgiev, Javier G \'o mez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale, 2025
2025
-
[36]
The S tructure of T ranslational T ilings in Z ^d
Rachel Greenfeld and Terence Tao. The S tructure of T ranslational T ilings in Z ^d . Discrete Analysis , 2021
2021
-
[37]
A counterexample to the periodic tiling conjecture
Rachel Greenfeld and Terence Tao. A counterexample to the periodic tiling conjecture. Annals of Mathematics , 200(1):301--363, 2024
2024
-
[38]
Undecidability of translational monotilings
Rachel Greenfeld and Terence Tao. Undecidability of translational monotilings. J. Eur. Math. Soc. , 2025
2025
-
[39]
Universal optimality of D ijkstra via beyond-worst-case heaps, 2023
Bernhard Haeupler, Richard Hlad\' i k, V\' a clav Rozho n , Robert Tarjan, and Jakub T e tek. Universal optimality of D ijkstra via beyond-worst-case heaps, 2023
2023
-
[40]
Horv \'a th, et al
Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z. Horv \'a th, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature , 2025
2025
-
[41]
Black-box PWPP is not T uring closed
Pavel Hub\' a c ek. Black-box PWPP is not T uring closed. CoRR , abs/2602.23809, 2026. https://doi.org/10.48550/arXiv.2602.23809
-
[42]
Hao Huang and Lin F. Yang. Winning gold at IMO 2025 with a model-agnostic verification-and-refinement pipeline, 2025
2025
-
[43]
Improved upper bounds for pairing heaps
John Iacono. Improved upper bounds for pairing heaps. In Scandinavian Workshop on Algorithm Theory , pages 32--45. Springer, 2000
2000
-
[44]
Integer factoring and modular square roots
Emil Je r \'a bek. Integer factoring and modular square roots. Journal of Computer and System Sciences , 82(2):380--394, 2016
2016
-
[45]
Rigidity of planar tilings
Richard Kenyon. Rigidity of planar tilings. Inventiones Mathematicae , 107(3):637--651, 1992
1992
-
[46]
Kleinberg, Sigal Oren, and Aleksandrs Slivkins
David Kempe, Jon M. Kleinberg, Sigal Oren, and Aleksandrs Slivkins. Selection and influence in cultural dynamics. Netw. Sci. , 4(1):1--27, 2016
2016
-
[47]
Open problems
Karel Kr \'a l. Open problems. https://kam.mff.cuni.cz/ kamak/static/problems/2020.pdf, 2020. KAMAK Problem Solving Workshop
2020
-
[48]
Zaverucha, and Ian Goldberg
Aniket Kate, Gregory M. Zaverucha, and Ian Goldberg. Constant-size commitments to polynomials and their applications. In Masayuki Abe, editor, Advances in Cryptology - ASIACRYPT 2010 - 16th International Conference on the Theory and Application of Cryptology and Information Security, Singapore, December 5-9, 2010. Proceedings , volume 6477 of Lecture Note...
2010
-
[49]
R. J. Lipton, E. Markakis, and A. Mehta. Playing large games using simple strategies. In Proceedings of the 4th ACM Conference on Electronic Commerce , EC '03, pages 36--41, New York, NY, USA, 2003. ACM
2003
-
[50]
Aleph Prover , 2025
Logical Intelligence . Aleph Prover , 2025. https://logicalintelligence.com/aleph-prover.html
2025
-
[51]
On knowledge-soundness of Plonk in ROM from falsifiable assumptions
Helger Lipmaa, Roberto Parisella, and Janno Siim. On knowledge-soundness of Plonk in ROM from falsifiable assumptions. In Yael Tauman Kalai and Seny F. Kamara, editors, Advances in Cryptology - CRYPTO 2025 - 45th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 17-21, 2025, Proceedings, Part VII , volume 16006 of Lecture Notes in...
2025
-
[52]
Lagarias and Yang Wang
Jeffrey C. Lagarias and Yang Wang. Tiling the line with translates of one tile. Inventiones Mathematicae , 124(1--3):341--365, 1996
1996
-
[53]
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 , 2023
work page internal anchor Pith review arXiv 2023
-
[54]
AlphaEvolve : A coding agent for scientific and algorithmic discovery, 2025
Alexander Novikov et al. AlphaEvolve : A coding agent for scientific and algorithmic discovery, 2025
2025
-
[55]
Our First Proof submissions
OpenAI . Our First Proof submissions. https://openai.com/index/first-proof-submissions/, 2026
2026
-
[56]
Bowman, and Shi Feng
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations, 2024
2024
-
[57]
On infinite sets with no 3 on a line, 2026
Moe Putterman, Mehtaab Sawhney, and Gregory Valiant. On infinite sets with no 3 on a line, 2026
2026
-
[58]
Pawan Kumar, Emilien Dupont, Francisco J
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature , 625(7995):468--475, 2024
2024
-
[59]
Self-adjusting binary search trees
Daniel Dominic Sleator and Robert Endre Tarjan. Self-adjusting binary search trees. J. ACM , 32(3):652--686, 1985
1985
-
[60]
Trinh, Yuhuai Wu, Quoc V
Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature , 625(7995):476--482, 2024
2024
-
[61]
130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026
Josef Urban. 130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026
2026
-
[62]
Hilbert: Recursively building formal proofs with informal reasoning, 2025
Sumanth Varambally et al. Hilbert: Recursively building formal proofs with informal reasoning, 2025
2025
-
[63]
Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, et al
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, et al. Accelerating scientific research with G emini: Case studies and common techniques, 2026
2026
-
[64]
Lower bounds for accessing binary search trees with rotations
Robert Wilber. Lower bounds for accessing binary search trees with rotations. SIAM J. Comput. , 18(1):56--67, 1989
1989
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.