Neuro-Symbolic Proof Generation for Scaling Systems Software Verification
Pith reviewed 2026-05-15 08:58 UTC · model grok-4.3
The pith
A neuro-symbolic framework proves up to 77.6 percent of theorems in the seL4 verification project by combining LLM step prediction with symbolic repair.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The neuro-symbolic proof generation framework automates proof search by performing best-first tree search over proof states, repeatedly querying a fine-tuned LLM for candidate next proof steps, and employing a suite of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls, achieving up to 77.6 percent success on the FVEL seL4 benchmark while solving significantly more multi-step proofs than previous approaches.
What carries the argument
Best-first tree search over proof states that queries a fine-tuned LLM for next candidate proof steps and applies symbolic ITP tools for repair, ranking, filtering, and subgoal discharge.
Where Pith is reading between the lines
- The approach could be ported to other interactive theorem provers if they expose comparable fine-grained proof-state interfaces and automation hooks.
- Further scaling might allow routine automated verification of entire critical software stacks that today require years of expert manual proof effort.
- The same repair-and-rank loop could be applied to other search-based reasoning tasks such as program synthesis or hardware design verification.
Load-bearing premise
Fine-tuning an LLM on proof-state and step pairs plus symbolic repair will reliably produce valid next steps for unseen theorems without causing excessive unrepairable failures or search explosion.
What would settle it
Testing the framework on a fresh collection of theorems from seL4 or another large Isabelle project and checking whether the success rate falls substantially below 77.6 percent would directly test the central claim.
Figures
read the original abstract
Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for system-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6\% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further Isabelle benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a neuro-symbolic proof generation framework for Isabelle that performs best-first tree search, using a fine-tuned LLM to propose next proof steps and symbolic ITP tools for repair, filtering, ranking, and subgoal discharge. It reports up to 77.6% success on the FVEL seL4 benchmark (substantially above prior LLM baselines and standalone Sledgehammer) plus strong results on additional Isabelle developments, attributing gains to data-efficient adaptation and semantics-informed pruning.
Significance. If the central empirical claims hold under proper decontamination, the work offers a concrete, reproducible advance toward scaling automated verification of large systems like seL4 by showing that LLM next-step prediction plus symbolic repair can solve substantially more multi-step proofs than either component alone. The new fine-grained Isabelle REPL is a reusable engineering contribution that lowers the barrier for future neuro-symbolic experiments.
major comments (3)
- [Training data section (likely §4)] Training data section (likely §4): the manuscript must explicitly state whether any seL4 theorems or their proof states were present in the fine-tuning corpus of proof-state/step pairs. Without a clear decontamination statement or train/test split description, the 77.6% success rate cannot be interpreted as evidence of generalization rather than memorization.
- [Evaluation section (§5)] Evaluation section (§5): the reported 77.6% success rate and multi-step proof counts lack an explicit definition of what constitutes a solved theorem, how partial or repaired proofs are tallied, failure-mode categorization, or any statistical significance tests. These omissions leave the headline performance comparison only moderately supported.
- [Framework description (§3)] Framework description (§3): while the high-level neuro-symbolic loop is clear, the paper should provide more detail on the exact repair heuristics, ranking function, and termination conditions used when the LLM step is rejected; without these, it is difficult to assess why the approach solves significantly more multi-step proofs than the baselines.
minor comments (2)
- [Abstract] Abstract: the phrase 'additional Isabelle developments' should name the specific benchmarks (e.g., the exact theories or repositories) to allow readers to assess the scope of the generalization claim.
- [Notation consistency] Notation consistency: ensure 'proof state' and 'proof step' are used uniformly; occasional shifts to 'goal' or 'tactic' without definition can confuse readers unfamiliar with Isabelle internals.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications, which will strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: Training data section (likely §4): the manuscript must explicitly state whether any seL4 theorems or their proof states were present in the fine-tuning corpus of proof-state/step pairs. Without a clear decontamination statement or train/test split description, the 77.6% success rate cannot be interpreted as evidence of generalization rather than memorization.
Authors: We agree that an explicit decontamination statement is required for proper interpretation of the results. The fine-tuning corpus was constructed exclusively from Isabelle proof libraries and developments that exclude the seL4 benchmark. In the revised manuscript we will add a dedicated paragraph in Section 4 that (i) lists the exact sources used for the proof-state/step pairs, (ii) confirms that no seL4 theorems or their intermediate proof states appear in the training data, and (iii) describes the train/test split procedure to ensure no leakage. revision: yes
-
Referee: Evaluation section (§5): the reported 77.6% success rate and multi-step proof counts lack an explicit definition of what constitutes a solved theorem, how partial or repaired proofs are tallied, failure-mode categorization, or any statistical significance tests. These omissions leave the headline performance comparison only moderately supported.
Authors: We acknowledge the need for greater precision in the evaluation protocol. In the revised Section 5 we will add: (1) an explicit definition of a solved theorem (a complete, Isabelle-verified proof script with no remaining subgoals), (2) clarification that only fully verified proofs are counted as successes (repaired or partial proofs are not tallied as solved), (3) a breakdown of failure modes (e.g., search timeout, repair failure, LLM step rejection), and (4) statistical significance tests (McNemar’s test) comparing our method against the LLM-only and Sledgehammer baselines. These additions will make the performance claims more rigorously supported. revision: yes
-
Referee: Framework description (§3): while the high-level neuro-symbolic loop is clear, the paper should provide more detail on the exact repair heuristics, ranking function, and termination conditions used when the LLM step is rejected; without these, it is difficult to assess why the approach solves significantly more multi-step proofs than the baselines.
Authors: We agree that additional implementation details will help readers understand the performance gains. In the revised Section 3 we will expand the description of the symbolic components to include: (i) the precise repair heuristics (specific Isabelle tactics and auto methods invoked on rejected steps), (ii) the ranking function (a weighted combination of LLM log-probability, subgoal count, and proof-state complexity), and (iii) the termination conditions (maximum search depth of 20, per-node timeout of 30 seconds, and global budget of 500 LLM queries). These details will clarify how the neuro-symbolic integration enables solving more multi-step proofs. revision: yes
Circularity Check
No circularity: empirical success rates are measured outcomes on public benchmarks
full rationale
The paper describes a neuro-symbolic framework that fine-tunes LLMs on proof-state/step pairs and evaluates the resulting system on the FVEL seL4 benchmark and other Isabelle developments. Reported figures such as the 77.6% success rate are direct empirical measurements of theorems proved, not quantities derived from equations or parameters that are defined in terms of the same measured outcomes. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims rest on observable performance rather than internal redefinition.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs trained on proof state-step pairs can suggest useful next steps for unseen theorems
- domain assumption Symbolic ITP tools can reliably repair, filter, and discharge subgoals generated by the LLM
Reference graph
Works this paper leans on
-
[1]
Edmund M. Clarke and Jeannette M. Wing. Formal methods: state of the art and future directions.ACM Computing Surveys, 28(4):626–643, 1996
work page 1996
-
[2]
A static analyzer for large safety-critical software
Bruno Blanchet, Patrick Cousot, Radhia Cousot, Jérôme Feret, Laurent Mauborgne, Antoine Miné, David Mon- niaux, and Xavier Rival. A static analyzer for large safety-critical software. InProceedings of the ACM SIGPLAN Conference on Programming Language De- sign and Implementation (PLDI), pages 196–207, 2003
work page 2003
-
[3]
Formal methods: Practice and ex- perience.ACM Computing Surveys, 41(4):1–36, 2009
James Woodcock, Peter Gorm Larsen, Juan Bicarregui, and John Fitzgerald. Formal methods: Practice and ex- perience.ACM Computing Surveys, 41(4):1–36, 2009
work page 2009
-
[4]
Compcert-a formally verified optimizing compiler
Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bern- hard Schommer, Markus Pister, and Christian Ferdinand. Compcert-a formally verified optimizing compiler. In ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress, 2016
work page 2016
-
[5]
Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Murray, Thomas Sewell, Rafal Kolanski, and Ger- not Heiser. Comprehensive formal verification of an os microkernel.ACM Transactions on Computer Systems (TOCS), 32(1):1–70, 2014
work page 2014
-
[6]
Talia Ringer, Karl Palmskog, Ilya Sergey, Milos Glig- oric, Zachary Tatlock, et al. Qed at large: A survey of engineering of formally verified software.Foundations and Trends® in Programming Languages, 5(2-3):102– 281, 2019
work page 2019
-
[7]
Can large language models verify system software? a case study using fscq as a benchmark
Jianxing Qin, Alexander Du, Danfeng Zhang, Matthew Lentz, and Danyang Zhuo. Can large language models verify system software? a case study using fscq as a benchmark. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems, pages 34–41, 2025
work page 2025
-
[8]
Rango: Adaptive retrieval-augmented proving for automated software ver- ification
Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F Fer- reira, Sorin Lerner, and Emily First. Rango: Adaptive retrieval-augmented proving for automated software ver- ification. In2025 IEEE/ACM 47th International Confer- ence on Software Engineering (ICSE), pages 347–359, 2025
work page 2025
-
[9]
Selene: Pio- neering automated proof in software verification
Lichen Zhang, Shuai Lu, and Nan Duan. Selene: Pio- neering automated proof in software verification. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 1776–1789, 2024
work page 2024
-
[10]
Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, and Xiaodan Liang. Fvel: Interactive formal verification environment with large language models via theorem proving.Advances in Neural Information Processing Systems, 37:54932–54946, 2024
work page 2024
-
[11]
Using crash hoare logic for certifying the fscq file system
Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chli- pala, M Frans Kaashoek, and Nickolai Zeldovich. Using crash hoare logic for certifying the fscq file system. In Proceedings of the 25th Symposium on Operating Sys- tems Principles, pages 18–37, 2015
work page 2015
-
[12]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[14]
Nit- pick: A counterexample generator for higher-order logic based on a relational model finder
Jasmin Christian Blanchette and Tobias Nipkow. Nit- pick: A counterexample generator for higher-order logic based on a relational model finder. InInternational con- ference on interactive theorem proving, pages 131–146. Springer, 2010
work page 2010
-
[15]
Quickcheck: a lightweight tool for random testing of haskell programs
Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. InProceedings of the fifth ACM SIGPLAN international conference on Functional programming, pages 268–279, 2000
work page 2000
-
[16]
Sascha Böhme and Tobias Nipkow. Sledgehammer: judgement day. InInternational Joint Conference on Automated Reasoning, pages 107–121. Springer, 2010
work page 2010
-
[17]
Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Jiayu Wang, Dahua Lin, and Kai Chen. Internlm2. 5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems.arXiv preprint arXiv:2410.15700, 2024
-
[18]
Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Kai Shen. Bfs-prover: Scalable best-first tree search for llm- based automatic theorem proving.arXiv preprint arXiv:2502.03438, 2025
-
[19]
Gemini: A family of highly capable multimodal models
Google DeepMind. Gemini: A family of highly capable multimodal models. https://deepmind.google/ technologies/gemini/, 2024. Accessed model vari- ant: Gemini 3 Pro. 14
work page 2024
-
[20]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chala- mala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: The- orem proving with retrieval-augmented language mod- els.Advances in Neural Information Processing Sys- tems, 36:21573–21612, 2023
work page 2023
-
[22]
Llm library learning fails: A lego-prover case study.arXiv preprint arXiv:2504.03048, 2025
Ian Berlot-Attwell, Frank Rudzicz, and Xujie Si. Llm library learning fails: A lego-prover case study.arXiv preprint arXiv:2504.03048, 2025
-
[23]
A compute-matched re-evaluation of trove on math.arXiv preprint arXiv:2507.22069, 2025
Tobias Sesterhenn, Ian Berlot-Attwell, Janis Zenkner, and Christian Bartelt. A compute-matched re-evaluation of trove on math.arXiv preprint arXiv:2507.22069, 2025
-
[24]
Tacticzero: Learning to prove theo- rems from scratch with deep reinforcement learning
Minchao Wu, Michael Norrish, Christian Walder, and Amir Dezfouli. Tacticzero: Learning to prove theo- rems from scratch with deep reinforcement learning. Advances in Neural Information Processing Systems, 34:9330–9342, 2021
work page 2021
-
[25]
Suozhi Huang, Peiyang Song, Robert Joseph George, and Anima Anandkumar. Leanprogress: Guiding search for neural theorem proving via proof progress prediction. arXiv preprint arXiv:2502.17925, 2025
-
[26]
Yuhuai Wu, Albert Jiang, Roger Grosse, and Jimmy Ba. Neural theorem proving on inequality problems.Artifi- cial Intelligence and Theorem Proving (AITP), 2020
work page 2020
-
[27]
Tactictoe: learning to prove with tactics.Journal of Automated Reasoning, 65(2):257–286, 2021
Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ra- mana Kumar, and Michael Norrish. Tactictoe: learning to prove with tactics.Journal of Automated Reasoning, 65(2):257–286, 2021
work page 2021
-
[28]
Machine-learned premise selection for lean
Bartosz Piotrowski, Ramon Fernández Mir, and Edward Ayers. Machine-learned premise selection for lean. In International Conference on Automated Reasoning with Analytic Tableaux and Related Methods, pages 175–186. Springer, 2023
work page 2023
-
[29]
Timothy Bourke and Gerwin Klein. Solve_direct: A tool for isabelle/hol to check whether a newly stated theorem can be solved directly by an existing theorem. https://isabelle.in.tum.de/library/HOL/ HOL/ISABELLE_HOME/src/Tools/solve_direct. ML.html, 2025. File: src/Tools/solve_direct.ML in Isabelle/HOL library
work page 2025
-
[30]
scala-isabelle: A scala library for interacting with isabelle
Dominique Unruh. scala-isabelle: A scala library for interacting with isabelle. https://github.com/ dominique-unruh/scala-isabelle, 2025. Ac- cessed 2025-11-18
work page 2025
-
[31]
N. M. Nishant and G. S. Mamatha. Using py4j for java-python communication.International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), 2023
work page 2023
-
[32]
Lisa: Language models of isabelle proofs
Albert Qiaochu Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. Lisa: Language models of isabelle proofs. In6th Conference on Artificial Intelligence and Theorem Proving, pages 378–392, 2021
work page 2021
-
[33]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Llamafactory: Unified efficient fine-tuning of 100+ lan- guage models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ lan- guage models. InProceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thai- land, 2024. Association for Computational Linguistics
work page 2024
-
[35]
Jia Meng and Lawrence C Paulson. Lightweight rele- vance filtering for machine-generated resolution prob- lems.Journal of Applied Logic, 7(1):41–57, 2009
work page 2009
-
[36]
Mash: machine learning for sledgehammer
Daniel Kühlwein, Jasmin Christian Blanchette, Cezary Kaliszyk, and Josef Urban. Mash: machine learning for sledgehammer. InInternational Conference on Interactive Theorem Proving, pages 35–50. Springer, 2013
work page 2013
-
[37]
Thomas Sewell, Simon Winwood, Peter Gammie, Toby Murray, June Andronick, and Gerwin Klein. seL4 enforces integrity. InProceedings of the Second Inter- national Conference on Interactive Theorem Proving, ITP ’11, pages 325–340, Berlin, Heidelberg, August
-
[38]
seL4: From general pur- pose to a proof of information flow enforcement
Toby Murray, Daniel Matichuk, Matthew Brassil, Peter Gammie, Timothy Bourke, Sean Seefried, Corey Lewis, Xin Gao, and Gerwin Klein. seL4: From general pur- pose to a proof of information flow enforcement. In IEEE Symposium on Security and Privacy, pages 415– 429, San Francisco, CA, May 2013. IEEE
work page 2013
-
[39]
Empirical study towards a leading indicator for cost of formal soft- ware verification
Daniel Matichuk, Toby Murray, June Andronick, Ross Jeffery, Gerwin Klein, and Mark Staples. Empirical study towards a leading indicator for cost of formal soft- ware verification. InProceedings of the 37th Interna- tional Conference on Software Engineering - Volume 1, ICSE ’15, page 722–732. IEEE Press, 2015
work page 2015
-
[40]
Binary coors capable or ‘correcting deletions, insertions, and reversals
VI Lcvenshtcin. Binary coors capable or ‘correcting deletions, insertions, and reversals. InSoviet physics- doklady, volume 10, 1966. 15
work page 1966
-
[41]
Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura.Bull Soc Vaudoise Sci Nat, 37:547–579, 1901
work page 1901
-
[42]
Freek Verbeek, Abhijith Bharadwaj, Joshua Bock- enek, Ian Roessle, Timmy Weerwag, and Binoy Ravin- dran. X86 instruction semantics and basic block sym- bolic execution.Archive of Formal Proofs, Octo- ber 2021. https://isa-afp.org/entries/X86_ Semantics.html, Formal proof development
work page 2021
-
[43]
A formal model of ieee floating point arithmetic.Archive of Formal Proofs, July
Lei Yu. A formal model of ieee floating point arithmetic.Archive of Formal Proofs, July
-
[44]
https://isa-afp.org/entries/IEEE_ Floating_Point.html, Formal proof development
-
[45]
Formal verification of mod- ern sat solvers.Archive of Formal Proofs, July 2008
Filip Mari ´c. Formal verification of mod- ern sat solvers.Archive of Formal Proofs, July 2008. https://isa-afp.org/entries/ SATSolverVerification.html, Formal proof devel- opment
work page 2008
-
[46]
Stephen Robertson, Hugo Zaragoza, et al. The proba- bilistic relevance framework: Bm25 and beyond.Foun- dations and Trends® in Information Retrieval, 3(4):333– 389, 2009
work page 2009
-
[47]
{CertiKOS}: An extensible architecture for building certified concurrent {OS} kernels
Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan New- man Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. {CertiKOS}: An extensible architecture for building certified concurrent {OS} kernels. In12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 653–669, 2016
work page 2016
-
[48]
Formally verified memory protection for a commodity multiprocessor hypervisor
Shih-Wei Li, Xupeng Li, Ronghui Gu, Jason Nieh, and John Zhuang Hui. Formally verified memory protection for a commodity multiprocessor hypervisor. In30th USENIX Security Symposium (USENIX Security 21), pages 3953–3970. USENIX Association, August 2021
work page 2021
-
[49]
The design and implementation of vampire.AI communications, 15(2-3):91–110, 2002
Alexandre Riazanov and Andrei V oronkov. The design and implementation of vampire.AI communications, 15(2-3):91–110, 2002
work page 2002
-
[50]
Faster, higher, stronger: E 2.3
Stephan Schulz, Simon Cruanes, and Petar Vukmirovi´c. Faster, higher, stronger: E 2.3. InInternational Confer- ence on Automated Deduction, pages 495–507. Springer, 2019
work page 2019
-
[51]
Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient smt solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008
work page 2008
-
[52]
cvc5: A versatile and industrial-strength smt solver
Haniel Barbosa, Clark Barrett, Martin Brain, Gereon Kremer, Hanna Lachnitt, Makai Mann, Abdalrhman Mohamed, Mudathir Mohamed, Aina Niemetz, Andres Nötzli, et al. cvc5: A versatile and industrial-strength smt solver. InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 415–442. Springer, 2022
work page 2022
-
[53]
Łukasz Czajka and Cezary Kaliszyk. Hammer for coq: Automation for dependent type theory.Journal of auto- mated reasoning, 61(1):423–453, 2018
work page 2018
-
[54]
Ironclad apps: {End-to-End} security via auto- mated {Full-System} verification
Chris Hawblitzel, Jon Howell, Jacob R Lorch, Ar- jun Narayan, Bryan Parno, Danfeng Zhang, and Brian Zill. Ironclad apps: {End-to-End} security via auto- mated {Full-System} verification. In11th USENIX symposium on operating systems design and implemen- tation (OSDI 14), pages 165–181, 2014
work page 2014
-
[55]
Jean Yang and Chris Hawblitzel. Safe to the last instruc- tion: automated verification of a type-safe operating system.Communications of the ACM, 54(12):123–131, 2011
work page 2011
-
[56]
Dafny: An automatic program verifier for functional correctness
K Rustan M Leino. Dafny: An automatic program verifier for functional correctness. InInternational con- ference on logic for programming artificial intelligence and reasoning, pages 348–370. Springer, 2010
work page 2010
-
[57]
Boogie: A mod- ular reusable verifier for object-oriented programs
Mike Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K Rustan M Leino. Boogie: A mod- ular reusable verifier for object-oriented programs. In International Symposium on Formal Methods for Com- ponents and Objects, pages 364–387. Springer, 2005
work page 2005
-
[58]
Holophrasm: a neural Automated Theorem Prover for higher-order logic
Daniel Whalen. Holophrasm: a neural automated the- orem prover for higher-order logic.arXiv preprint arXiv:1608.02644, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[59]
Generating correctness proofs with neural networks
Alex Sanchez-Stern, Yousef Alhessi, Lawrence Saul, and Sorin Lerner. Generating correctness proofs with neural networks. InProceedings of the 4th ACM SIG- PLAN International Workshop on Machine Learning and Programming Languages, pages 1–10, 2020
work page 2020
-
[60]
Diversity-driven automated formal verification
Emily First and Yuriy Brun. Diversity-driven automated formal verification. InProceedings of the 44th Inter- national Conference on Software Engineering, pages 749–761, 2022
work page 2022
-
[61]
Draft, sketch, and prove: Guiding formal theorem provers with informal proofs
Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Timo- thee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. InThe Eleventh International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[62]
Chenrui Cao, Liangcheng Song, Zenan Li, Xinyi Le, Xian Zhang, Hui Xue, and Fan Yang. Reviving dsp for advanced theorem proving in the era of reasoning models.arXiv preprint arXiv:2506.11487, 2025. 16
-
[63]
Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024
Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024
work page 2024
-
[64]
Baldur: Whole-proof generation and repair with large language models
Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1229–1241, 2023
work page 2023
-
[65]
Proof automation with large language models
Minghai Lu, Benjamin Delaware, and Tianyi Zhang. Proof automation with large language models. InPro- ceedings of the 39th IEEE/ACM International Confer- ence on Automated Software Engineering, pages 1509– 1520, 2024
work page 2024
-
[66]
Guided proof search using large language models and lemma extraction in coq
Tarun Prasad and Nada Amin. Guided proof search using large language models and lemma extraction in coq. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2024
work page 2025
-
[67]
Generative Language Modeling for Automated Theorem Proving
Stanislas Polu and Ilya Sutskever. Generative lan- guage modeling for automated theorem proving.arXiv preprint arXiv:2009.03393, 2020
work page internal anchor Pith review arXiv 2009
-
[68]
Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337– 26349, 2022
work page 2022
-
[69]
Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygó´ zd´ z, Piotr Miło´s, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding ham- mers to integrate language models and automated theo- rem provers.Advances in Neural Information Process- ing Systems, 35:8360–8373, 2022
work page 2022
-
[70]
Haoxiong Liu, Jiacheng Sun, Zhenguo Li, and An- drew C Yao. Proofaug: Efficient neural theorem proving via fine-grained proof structure analysis.arXiv preprint arXiv:2501.18310, 2025
-
[71]
Proof strategy extraction from llms for enhancing symbolic provers
Jian Fang, Yican Sun, and Yingfei Xiong. Proof strategy extraction from llms for enhancing symbolic provers. arXiv preprint arXiv:2510.10131, 2025. 17
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.