arxiv: 2604.16989 · v2 · submitted 2026-04-18 · 💻 cs.CL · cs.AI· cs.LG· cs.LO

Recognition: unknown

Bolzano: Case Studies in LLM-Assisted Mathematical Research

Martin Balko , Jan Greb\'ik , Pavel Hub\'a\v{c}ek , Martin Kouteck\'y , Mat\v{e}j Kripner , V\'aclav Rozho\v{n} , Robert \v{S}\'amal , Adri\'an Z\'ame\v{c}n\'ik

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.LO

keywords LLM-assisted researchmulti-agent systemsmathematical problem solvingautonomous theorem provingtheoretical computer scienceverifier agentspersistent knowledge basepublishable results

0 comments

The pith

An LLM multi-agent system generates five essentially autonomous publishable results on math and theoretical computer science problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports new results on eight problems produced using an open-source multi-agent LLM system that runs parallel prover agents in interaction with a verifier agent while carrying a persistent knowledge base across rounds. Six of the eight outcomes meet the threshold for publishable research, and five were achieved with essentially no human intervention beyond the initial setup. The work supplies concrete cases showing LLMs can contribute to actual mathematical research rather than only routine tasks or hints.

Core claim

The system produces new results on eight problems by orchestrating rounds of interaction between parallel prover agents and a verifier agent while maintaining a persistent knowledge base carried across rounds. When classified by significance and autonomy, six results reach publishable research level and five were produced essentially autonomously by the system itself.

What carries the argument

A multi-agent LLM system that coordinates parallel prover agents with a verifier agent and maintains a persistent knowledge base across interaction rounds.

If this is right

LLMs can reach publishable standards on selected problems in mathematics and theoretical computer science.
Multi-agent prover-verifier loops with shared memory support extended autonomous solving sessions.
Five of eight cases required only minimal human guidance after initial problem statements.
The same architecture applies across both pure mathematics and theoretical computer science questions.
Such systems can complement existing human-led research workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agent loops with persistent state could be tested on open problems that require chaining many steps.
If the autonomy holds under scrutiny, the approach points toward LLMs handling larger portions of research pipelines.
The pattern of prover and verifier agents may transfer to other verification-heavy domains such as formal methods or experimental design.

Load-bearing premise

The generated results are correct and novel and the claimed level of autonomy reflects the actual process without undisclosed human intervention.

What would settle it

Independent verification that finds an error in one of the eight claimed results or documentation showing more human input than stated would falsify the autonomy and publishability claims.

read the original abstract

We report new results on eight problems in mathematics and theoretical computer science, produced with the assistance of Bolzano, an open-source multi-agent LLM system. Bolzano orchestrates rounds of interaction between parallel prover agents and a verifier agent while maintaining a persistent knowledge base that is carried across rounds. Classified using the significance-autonomy taxonomy of Feng et al., six of the eight results reach the level of publishable research, and five of the eight were produced essentially autonomously by Bolzano. Our results provide evidence that LLMs can contribute meaningfully to mathematical research, complementing recent reports by Bubeck et al., Woodruff et al., and others.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Bolzano, an open-source multi-agent LLM system that orchestrates prover and verifier agents with a persistent knowledge base, and reports eight case studies in mathematics and theoretical computer science. Using the Feng et al. significance-autonomy taxonomy, it claims that six of the eight results reach publishable-research level and five were produced essentially autonomously by Bolzano, providing evidence that LLMs can contribute meaningfully to mathematical research.

Significance. If the correctness and autonomy claims hold, the work supplies concrete, open-source case studies that complement reports by Bubeck et al. and Woodruff et al., demonstrating a practical multi-agent workflow with persistent state. The open-source release of the system itself is a strength that enables reproducibility and further experimentation.

major comments (3)

[Case studies (results classification)] The central claim that five results are 'essentially autonomous' (per Feng et al. taxonomy) is load-bearing yet unsupported by full agent-interaction logs, problem-choice records, or blinded external review; internal classification by the authors risks circularity in assessing autonomy.
[Results and evaluation] No independent verification (formal proof checking, external expert review, or journal submission) of the six 'publishable' results is provided; given known risks of subtle errors in LLM-generated mathematics, this undermines the publishability claim.
[Bolzano system description and case studies] The paper does not report the exact prompting strategies, iteration counts, or human interventions in problem selection and refinement for the 'autonomous' cases, making it impossible to assess whether the autonomy threshold was met without undisclosed assistance.

minor comments (2)

[Abstract] The abstract would benefit from one-sentence descriptions of the mathematical domains or specific problems addressed in the eight case studies.
[Introduction] Ensure all references to the Feng et al. taxonomy include the full citation and a brief summary of the relevant categories used for classification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important areas where additional evidence and transparency can strengthen the manuscript. We address each major comment below and commit to a major revision that incorporates more detailed logs, verification steps, and system usage information while preserving the core contribution of the open-source Bolzano system and its case studies.

read point-by-point responses

Referee: [Case studies (results classification)] The central claim that five results are 'essentially autonomous' (per Feng et al. taxonomy) is load-bearing yet unsupported by full agent-interaction logs, problem-choice records, or blinded external review; internal classification by the authors risks circularity in assessing autonomy.

Authors: We agree that the current manuscript provides insufficient raw material for independent assessment of the autonomy classification. In the revision we will add (or link to) the complete agent-interaction traces and problem-choice records for the five cases classified as essentially autonomous. We will also include an explicit appendix that maps each case to the precise Feng et al. criteria, documenting the exact points at which human input occurred (limited to initial problem statement and system launch for those five cases). While a fully blinded external review cannot be performed retroactively for this submission, the added traces will allow any reader to apply the taxonomy themselves and thereby reduce the risk of circularity. revision: yes
Referee: [Results and evaluation] No independent verification (formal proof checking, external expert review, or journal submission) of the six 'publishable' results is provided; given known risks of subtle errors in LLM-generated mathematics, this undermines the publishability claim.

Authors: We accept that the absence of external verification weakens the strength of the 'publishable-research level' claim. The revised manuscript will (i) state more clearly that the six results are offered as case studies of the system rather than as independently certified theorems, (ii) add formal verification output (where the statements admit it) using existing theorem provers, and (iii) report the current submission status of each result to journals or conferences. We will also expand the discussion of the verifier agent's role and the internal consistency checks performed. These additions will not eliminate all risk of subtle errors but will make the evidential basis more transparent. revision: partial
Referee: [Bolzano system description and case studies] The paper does not report the exact prompting strategies, iteration counts, or human interventions in problem selection and refinement for the 'autonomous' cases, making it impossible to assess whether the autonomy threshold was met without undisclosed assistance.

Authors: We will revise the system-description and case-study sections to include the precise prompting templates (both prover and verifier), the average and range of iteration counts per case, and a tabulated record of every human intervention. For the five autonomous cases the table will show that interventions were confined to launching the system and, in two instances, restarting after an infrastructure failure; no mathematical content was supplied or edited by humans after the initial problem statement. This level of detail will allow readers to judge whether the autonomy threshold was met. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical case studies with external taxonomy

full rationale

The paper reports outcomes from an LLM system on eight mathematical problems without presenting any mathematical derivation, model, or first-principles chain. Results are classified using the external significance-autonomy taxonomy of Feng et al. (distinct authors) and compared to independent reports by Bubeck et al. and Woodruff et al. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear. Claims rest on empirical observation and external classification rather than internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This preprint reports empirical case studies on an LLM system rather than developing a new mathematical theory or derivation. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5457 in / 1010 out tokens · 74522 ms · 2026-05-10T06:58:26.801199+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Blumberg, Martin Hairer, Joe Kileel, Tamara G

Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams. First Proof , 2026

2026
[2]

N. Alon, O. Goldreich, J. H stad, and R. Peralta. Simple constructions of almost k -wise independent random variables. In 31st A nnual S ymposium on F oundations of C omputer S cience, V ol.\ I , II ( S t.\ L ouis, MO , 1990) , pages 544--553. IEEE Comput. Soc. Press, Los Alamitos, CA, 1990

1990
[3]

Ax-Prover : A deep reasoning agentic framework for theorem proving in mathematics and quantum physics, 2025

Benjamin Breen et al. Ax-Prover : A deep reasoning agentic framework for theorem proving in mathematics and quantum physics, 2025

2025
[4]

S. Barman. Approximating N ash equilibria and dense subgraphs via an approximate version of C arath\'eodory's theorem. SIAM J. Comput. , 47(3):960--981, 2018

2018
[5]

Babichenko, S

Y. Babichenko, S. Barman, and R. Peretz. Simple approximate equilibria in large games. In Proceedings of the Fifteenth ACM Conference on Economics and Computation , EC '14, page 753–770, New York, NY, USA, 2014. Association for Computing Machinery

2014
[6]

Spears, Derya Unutmaz, Kevin Weil, Steven Yin, and Nikita Zhivotovskiy

S\' e bastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee, Alexandru Lupsasca, Mehtaab Sawhney, Robert Scherrer, Mark Sellke, Brian K. Spears, Derya Unutmaz, Kevin Weil, Steven Yin, and Nikita Zhivotovskiy. Early science acceleration experiments with GPT -5, 2025

2025
[7]

Efficient polynomial commitment schemes for multiple points and polynomials

Dan Boneh, Justin Drake, Ben Fisch, and Ariel Gabizon. Efficient polynomial commitment schemes for multiple points and polynomials. IACR Cryptol. ePrint Arch. , page 81, 2020

2020
[8]

Periodicity and decidability of tilings of Z ^2

Siddhartha Bhattacharya. Periodicity and decidability of tilings of Z ^2 . American Journal of Mathematics , 142(1):255--266, 2020

2020
[9]

Bessy, F

S. Bessy, F. Havet, and E. Birmel\' e . Arc-chromatic number of digraphs in which every vertex has bounded outdegree or bounded indegree. J. Graph Theory , 53(4):315--332, December 2006

2006
[10]

CHOPIN : Optimal pairing-based multilinear polynomial commitments from bivariate KZG

Juraj Belohorec, Pavel Hub \'a c ek, Aleksi Kalsta, and Krist \'y na Ma s kov \'a . CHOPIN : Optimal pairing-based multilinear polynomial commitments from bivariate KZG . IACR Cryptol. ePrint Arch. , page 480, 2026. https://eprint.iacr.org/archive/2026/480/20260308:102759

2026
[11]

Buss and Alan S

Samuel R. Buss and Alan S. Johnson. Propositional proofs and reductions between NP search problems. Annals of Pure and Applied Logic , 163(9):1163--1182, 2012

2012
[12]

https://bolzano.app/heap-equivalence, 2025

Bolzano transcript: Heap equivalence. https://bolzano.app/heap-equivalence, 2025

2025
[13]

Bolzano Team . Bolzano. https://bolzano.app/, 2025

2025
[14]

https://bolzano.app/cce-supports, 2026

Bolzano transcript: CCE supports. https://bolzano.app/cce-supports, 2026

2026
[15]

https://bolzano.app/kkos-optimization, 2026

Bolzano transcript: KKOS optimization. https://bolzano.app/kkos-optimization, 2026

2026
[16]

https://bolzano.app/kzg-batching, 2026

Bolzano transcript: KZG batching. https://bolzano.app/kzg-batching, 2026

2026
[17]

https://bolzano.app/function-preimage-partitioning, 2026

Bolzano transcript: Function-preimage partitioning. https://bolzano.app/function-preimage-partitioning, 2026

2026
[18]

https://bolzano.app/pwpp-separation, 2026

Bolzano transcript: PWPP separation. https://bolzano.app/pwpp-separation, 2026

2026
[19]

https://bolzano.app/multi-slope-tilings, 2026

Bolzano transcript: Multi-slope tilings. https://bolzano.app/multi-slope-tilings, 2026

2026
[20]

https://bolzano.app/problem/ade3d456-8788-4d11-b55f-926644ddbc42/files?file_id=d056ac12-6ef7-496a-a790-eba9105816bb, 2026

Bolzano transcript: Interleaving and W ilber's first bound. https://bolzano.app/problem/ade3d456-8788-4d11-b55f-926644ddbc42/files?file_id=d056ac12-6ef7-496a-a790-eba9105816bb, 2026

2026
[21]

Balko and T

M. Balko and T. C \' i z ek. Approximating nash equilibria in sparse games. In preparation, 2026

2026
[22]

Settling the complexity of computing two-player N ash equilibria

Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Settling the complexity of computing two-player N ash equilibria. J. ACM , 56(3):Art. 14, 57, 2009

2009
[23]

The Open Proof Corpus : A large-scale study of LLM -generated mathematical proofs, 2025

Jasper Dekoninck et al. The Open Proof Corpus : A large-scale study of LLM -generated mathematical proofs, 2025

2025
[24]

Translational tilings by axes-parallel polygonal sets

Jaume de Dios Pont , Jan Greb\' i k, Rachel Greenfeld, and Jos\' e Madrid. Translational tilings by axes-parallel polygonal sets. Work in progress, 2026

2026
[25]

Goldberg, and Christos H

Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a N ash equilibrium. SIAM J. Comput. , 39(1):195--259, 2009

2009
[26]

Demaine, Dion Harmon, John Iacono, and Mihai P a tra s cu

Erik D. Demaine, Dion Harmon, John Iacono, and Mihai P a tra s cu. Dynamic optimality---almost. SIAM J. Comput. , 37(1):240--251, 2007

2007
[27]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023

2023
[28]

A priority queue with the working-set property

Amr Elmasry. A priority queue with the working-set property. Int. J. Found. Comput. Sci. , 17(6):1455--1466, 2006

2006
[29]

Semi-autonomous mathematics discovery with G emini: A case study on the Erd o s problems, 2026

Tony Feng et al. Semi-autonomous mathematics discovery with G emini: A case study on the Erd o s problems, 2026

2026
[30]

Towards autonomous mathematics research, 2026

Tony Feng et al. Towards autonomous mathematics research, 2026

2026
[31]

Opinion diffusion and campaigning on society graphs

Piotr Faliszewski, Rica Gonen, Martin Kouteck \' y , and Nimrod Talmon. Opinion diffusion and campaigning on society graphs. J. Log. Comput. , 32(6):1162--1194, 2022

2022
[32]

Black-box PPP is not T uring-closed

Noah Fleming, Stefan Grosser, Toniann Pitassi, and Robert Robere. Black-box PPP is not T uring-closed. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC) , pages 1405--1414, 2024

2024
[33]

Aletheia tackles FirstProof autonomously, 2026

Tony Feng, Junehyuk Jung, Sang hyun Kim, Carlo Pagano, et al. Aletheia tackles FirstProof autonomously, 2026

2026
[34]

Right is not enough: The pitfalls of outcome supervision in training LLMs for math reasoning, 2025

Jiaxing Guo et al. Right is not enough: The pitfalls of outcome supervision in training LLMs for math reasoning, 2025

2025
[35]

Mathematical exploration and discovery at scale, 2025

Bogdan Georgiev, Javier G \'o mez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale, 2025

2025
[36]

The S tructure of T ranslational T ilings in Z ^d

Rachel Greenfeld and Terence Tao. The S tructure of T ranslational T ilings in Z ^d . Discrete Analysis , 2021

2021
[37]

A counterexample to the periodic tiling conjecture

Rachel Greenfeld and Terence Tao. A counterexample to the periodic tiling conjecture. Annals of Mathematics , 200(1):301--363, 2024

2024
[38]

Undecidability of translational monotilings

Rachel Greenfeld and Terence Tao. Undecidability of translational monotilings. J. Eur. Math. Soc. , 2025

2025
[39]

Universal optimality of D ijkstra via beyond-worst-case heaps, 2023

Bernhard Haeupler, Richard Hlad\' i k, V\' a clav Rozho n , Robert Tarjan, and Jakub T e tek. Universal optimality of D ijkstra via beyond-worst-case heaps, 2023

2023
[40]

Horv \'a th, et al

Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z. Horv \'a th, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature , 2025

2025
[41]

Black-box PWPP is not T uring closed

Pavel Hub\' a c ek. Black-box PWPP is not T uring closed. CoRR , abs/2602.23809, 2026. https://doi.org/10.48550/arXiv.2602.23809

work page doi:10.48550/arxiv.2602.23809 2026
[42]

Hao Huang and Lin F. Yang. Winning gold at IMO 2025 with a model-agnostic verification-and-refinement pipeline, 2025

2025
[43]

Improved upper bounds for pairing heaps

John Iacono. Improved upper bounds for pairing heaps. In Scandinavian Workshop on Algorithm Theory , pages 32--45. Springer, 2000

2000
[44]

Integer factoring and modular square roots

Emil Je r \'a bek. Integer factoring and modular square roots. Journal of Computer and System Sciences , 82(2):380--394, 2016

2016
[45]

Rigidity of planar tilings

Richard Kenyon. Rigidity of planar tilings. Inventiones Mathematicae , 107(3):637--651, 1992

1992
[46]

Kleinberg, Sigal Oren, and Aleksandrs Slivkins

David Kempe, Jon M. Kleinberg, Sigal Oren, and Aleksandrs Slivkins. Selection and influence in cultural dynamics. Netw. Sci. , 4(1):1--27, 2016

2016
[47]

Open problems

Karel Kr \'a l. Open problems. https://kam.mff.cuni.cz/ kamak/static/problems/2020.pdf, 2020. KAMAK Problem Solving Workshop

2020
[48]

Zaverucha, and Ian Goldberg

Aniket Kate, Gregory M. Zaverucha, and Ian Goldberg. Constant-size commitments to polynomials and their applications. In Masayuki Abe, editor, Advances in Cryptology - ASIACRYPT 2010 - 16th International Conference on the Theory and Application of Cryptology and Information Security, Singapore, December 5-9, 2010. Proceedings , volume 6477 of Lecture Note...

2010
[49]

R. J. Lipton, E. Markakis, and A. Mehta. Playing large games using simple strategies. In Proceedings of the 4th ACM Conference on Electronic Commerce , EC '03, pages 36--41, New York, NY, USA, 2003. ACM

2003
[50]

Aleph Prover , 2025

Logical Intelligence . Aleph Prover , 2025. https://logicalintelligence.com/aleph-prover.html

2025
[51]

On knowledge-soundness of Plonk in ROM from falsifiable assumptions

Helger Lipmaa, Roberto Parisella, and Janno Siim. On knowledge-soundness of Plonk in ROM from falsifiable assumptions. In Yael Tauman Kalai and Seny F. Kamara, editors, Advances in Cryptology - CRYPTO 2025 - 45th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 17-21, 2025, Proceedings, Part VII , volume 16006 of Lecture Notes in...

2025
[52]

Lagarias and Yang Wang

Jeffrey C. Lagarias and Yang Wang. Tiling the line with translates of one tile. Inventiones Mathematicae , 124(1--3):341--365, 1996

1996
[53]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 , 2023

work page internal anchor Pith review arXiv 2023
[54]

AlphaEvolve : A coding agent for scientific and algorithmic discovery, 2025

Alexander Novikov et al. AlphaEvolve : A coding agent for scientific and algorithmic discovery, 2025

2025
[55]

Our First Proof submissions

OpenAI . Our First Proof submissions. https://openai.com/index/first-proof-submissions/, 2026

2026
[56]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations, 2024

2024
[57]

On infinite sets with no 3 on a line, 2026

Moe Putterman, Mehtaab Sawhney, and Gregory Valiant. On infinite sets with no 3 on a line, 2026

2026
[58]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature , 625(7995):468--475, 2024

2024
[59]

Self-adjusting binary search trees

Daniel Dominic Sleator and Robert Endre Tarjan. Self-adjusting binary search trees. J. ACM , 32(3):652--686, 1985

1985
[60]

Trinh, Yuhuai Wu, Quoc V

Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature , 625(7995):476--482, 2024

2024
[61]

130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026

Josef Urban. 130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026

2026
[62]

Hilbert: Recursively building formal proofs with informal reasoning, 2025

Sumanth Varambally et al. Hilbert: Recursively building formal proofs with informal reasoning, 2025

2025
[63]

Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, et al

David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, et al. Accelerating scientific research with G emini: Case studies and common techniques, 2026

2026
[64]

Lower bounds for accessing binary search trees with rotations

Robert Wilber. Lower bounds for accessing binary search trees with rotations. SIAM J. Comput. , 18(1):56--67, 1989

1989