pith. sign in

arxiv: 2603.19715 · v2 · submitted 2026-03-20 · 💻 cs.AI

Neuro-Symbolic Proof Generation for Scaling Systems Software Verification

Pith reviewed 2026-05-15 08:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords neuro-symbolic AIautomated theorem provingsystems software verificationIsabelleseL4LLM fine-tuningproof search
0
0 comments X

The pith

A neuro-symbolic framework proves up to 77.6 percent of theorems in the seL4 verification project by combining LLM step prediction with symbolic repair.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that automates proof search for large-scale systems verification by running a best-first tree search over proof states. At each state an LLM, fine-tuned on pairs of proof states and valid steps, proposes the next tactic or lemma. Symbolic components from the interactive theorem prover then repair invalid suggestions, prune unproductive branches, rank remaining states, and discharge simple subgoals when progress stalls. On the seL4 benchmark this hybrid process succeeds on far more theorems than prior LLM-only methods or standalone Sledgehammer, including many that require multiple reasoning steps. The same system also generalizes to other Isabelle developments without retraining.

Core claim

The neuro-symbolic proof generation framework automates proof search by performing best-first tree search over proof states, repeatedly querying a fine-tuned LLM for candidate next proof steps, and employing a suite of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls, achieving up to 77.6 percent success on the FVEL seL4 benchmark while solving significantly more multi-step proofs than previous approaches.

What carries the argument

Best-first tree search over proof states that queries a fine-tuned LLM for next candidate proof steps and applies symbolic ITP tools for repair, ranking, filtering, and subgoal discharge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be ported to other interactive theorem provers if they expose comparable fine-grained proof-state interfaces and automation hooks.
  • Further scaling might allow routine automated verification of entire critical software stacks that today require years of expert manual proof effort.
  • The same repair-and-rank loop could be applied to other search-based reasoning tasks such as program synthesis or hardware design verification.

Load-bearing premise

Fine-tuning an LLM on proof-state and step pairs plus symbolic repair will reliably produce valid next steps for unseen theorems without causing excessive unrepairable failures or search explosion.

What would settle it

Testing the framework on a fresh collection of theorems from seL4 or another large Isabelle project and checking whether the success rate falls substantially below 77.6 percent would directly test the central claim.

Figures

Figures reproduced from arXiv: 2603.19715 by Baoding He, Taolue Chen, Wei Sun, Xiaoxing Ma, Yuan Yao, Zenan Li, Zhendong Su.

Figure 1
Figure 1. Figure 1: An example theorem and its human-written proof from the seL4 project, shown alongside LLM-generated attempts. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our neuro-symbolic proof-search framework. Starting from the initial proof state, the system repeatedly [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proof success rate across ground-truth proof lengths. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Proof success rates across the six seL4 session cat [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Proof completion rate under varying proportions [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sequence and Jaccard similarity between gener [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance of our proof search framework on ad [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for system-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6\% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further Isabelle benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a neuro-symbolic proof generation framework for Isabelle that performs best-first tree search, using a fine-tuned LLM to propose next proof steps and symbolic ITP tools for repair, filtering, ranking, and subgoal discharge. It reports up to 77.6% success on the FVEL seL4 benchmark (substantially above prior LLM baselines and standalone Sledgehammer) plus strong results on additional Isabelle developments, attributing gains to data-efficient adaptation and semantics-informed pruning.

Significance. If the central empirical claims hold under proper decontamination, the work offers a concrete, reproducible advance toward scaling automated verification of large systems like seL4 by showing that LLM next-step prediction plus symbolic repair can solve substantially more multi-step proofs than either component alone. The new fine-grained Isabelle REPL is a reusable engineering contribution that lowers the barrier for future neuro-symbolic experiments.

major comments (3)
  1. [Training data section (likely §4)] Training data section (likely §4): the manuscript must explicitly state whether any seL4 theorems or their proof states were present in the fine-tuning corpus of proof-state/step pairs. Without a clear decontamination statement or train/test split description, the 77.6% success rate cannot be interpreted as evidence of generalization rather than memorization.
  2. [Evaluation section (§5)] Evaluation section (§5): the reported 77.6% success rate and multi-step proof counts lack an explicit definition of what constitutes a solved theorem, how partial or repaired proofs are tallied, failure-mode categorization, or any statistical significance tests. These omissions leave the headline performance comparison only moderately supported.
  3. [Framework description (§3)] Framework description (§3): while the high-level neuro-symbolic loop is clear, the paper should provide more detail on the exact repair heuristics, ranking function, and termination conditions used when the LLM step is rejected; without these, it is difficult to assess why the approach solves significantly more multi-step proofs than the baselines.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'additional Isabelle developments' should name the specific benchmarks (e.g., the exact theories or repositories) to allow readers to assess the scope of the generalization claim.
  2. [Notation consistency] Notation consistency: ensure 'proof state' and 'proof step' are used uniformly; occasional shifts to 'goal' or 'tactic' without definition can confuse readers unfamiliar with Isabelle internals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications, which will strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: Training data section (likely §4): the manuscript must explicitly state whether any seL4 theorems or their proof states were present in the fine-tuning corpus of proof-state/step pairs. Without a clear decontamination statement or train/test split description, the 77.6% success rate cannot be interpreted as evidence of generalization rather than memorization.

    Authors: We agree that an explicit decontamination statement is required for proper interpretation of the results. The fine-tuning corpus was constructed exclusively from Isabelle proof libraries and developments that exclude the seL4 benchmark. In the revised manuscript we will add a dedicated paragraph in Section 4 that (i) lists the exact sources used for the proof-state/step pairs, (ii) confirms that no seL4 theorems or their intermediate proof states appear in the training data, and (iii) describes the train/test split procedure to ensure no leakage. revision: yes

  2. Referee: Evaluation section (§5): the reported 77.6% success rate and multi-step proof counts lack an explicit definition of what constitutes a solved theorem, how partial or repaired proofs are tallied, failure-mode categorization, or any statistical significance tests. These omissions leave the headline performance comparison only moderately supported.

    Authors: We acknowledge the need for greater precision in the evaluation protocol. In the revised Section 5 we will add: (1) an explicit definition of a solved theorem (a complete, Isabelle-verified proof script with no remaining subgoals), (2) clarification that only fully verified proofs are counted as successes (repaired or partial proofs are not tallied as solved), (3) a breakdown of failure modes (e.g., search timeout, repair failure, LLM step rejection), and (4) statistical significance tests (McNemar’s test) comparing our method against the LLM-only and Sledgehammer baselines. These additions will make the performance claims more rigorously supported. revision: yes

  3. Referee: Framework description (§3): while the high-level neuro-symbolic loop is clear, the paper should provide more detail on the exact repair heuristics, ranking function, and termination conditions used when the LLM step is rejected; without these, it is difficult to assess why the approach solves significantly more multi-step proofs than the baselines.

    Authors: We agree that additional implementation details will help readers understand the performance gains. In the revised Section 3 we will expand the description of the symbolic components to include: (i) the precise repair heuristics (specific Isabelle tactics and auto methods invoked on rejected steps), (ii) the ranking function (a weighted combination of LLM log-probability, subgoal count, and proof-state complexity), and (iii) the termination conditions (maximum search depth of 20, per-node timeout of 30 seconds, and global budget of 500 LLM queries). These details will clarify how the neuro-symbolic integration enables solving more multi-step proofs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical success rates are measured outcomes on public benchmarks

full rationale

The paper describes a neuro-symbolic framework that fine-tunes LLMs on proof-state/step pairs and evaluates the resulting system on the FVEL seL4 benchmark and other Isabelle developments. Reported figures such as the 77.6% success rate are direct empirical measurements of theorems proved, not quantities derived from equations or parameters that are defined in terms of the same measured outcomes. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims rest on observable performance rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can be usefully fine-tuned for proof-step prediction and that symbolic tools can repair most invalid suggestions; no new entities are postulated and no free parameters are explicitly fitted in the abstract.

axioms (2)
  • domain assumption LLMs trained on proof state-step pairs can suggest useful next steps for unseen theorems
    Invoked in the description of the neural component and fine-tuning process.
  • domain assumption Symbolic ITP tools can reliably repair, filter, and discharge subgoals generated by the LLM
    Stated as part of the synergy that enables the search to progress.

pith-pipeline@v0.9.0 · 5544 in / 1341 out tokens · 36672 ms · 2026-05-15T08:58:33.560720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 5 internal anchors

  1. [1]

    Clarke and Jeannette M

    Edmund M. Clarke and Jeannette M. Wing. Formal methods: state of the art and future directions.ACM Computing Surveys, 28(4):626–643, 1996

  2. [2]

    A static analyzer for large safety-critical software

    Bruno Blanchet, Patrick Cousot, Radhia Cousot, Jérôme Feret, Laurent Mauborgne, Antoine Miné, David Mon- niaux, and Xavier Rival. A static analyzer for large safety-critical software. InProceedings of the ACM SIGPLAN Conference on Programming Language De- sign and Implementation (PLDI), pages 196–207, 2003

  3. [3]

    Formal methods: Practice and ex- perience.ACM Computing Surveys, 41(4):1–36, 2009

    James Woodcock, Peter Gorm Larsen, Juan Bicarregui, and John Fitzgerald. Formal methods: Practice and ex- perience.ACM Computing Surveys, 41(4):1–36, 2009

  4. [4]

    Compcert-a formally verified optimizing compiler

    Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bern- hard Schommer, Markus Pister, and Christian Ferdinand. Compcert-a formally verified optimizing compiler. In ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress, 2016

  5. [5]

    Comprehensive formal verification of an os microkernel.ACM Transactions on Computer Systems (TOCS), 32(1):1–70, 2014

    Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Murray, Thomas Sewell, Rafal Kolanski, and Ger- not Heiser. Comprehensive formal verification of an os microkernel.ACM Transactions on Computer Systems (TOCS), 32(1):1–70, 2014

  6. [6]

    Qed at large: A survey of engineering of formally verified software.Foundations and Trends® in Programming Languages, 5(2-3):102– 281, 2019

    Talia Ringer, Karl Palmskog, Ilya Sergey, Milos Glig- oric, Zachary Tatlock, et al. Qed at large: A survey of engineering of formally verified software.Foundations and Trends® in Programming Languages, 5(2-3):102– 281, 2019

  7. [7]

    Can large language models verify system software? a case study using fscq as a benchmark

    Jianxing Qin, Alexander Du, Danfeng Zhang, Matthew Lentz, and Danyang Zhuo. Can large language models verify system software? a case study using fscq as a benchmark. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems, pages 34–41, 2025

  8. [8]

    Rango: Adaptive retrieval-augmented proving for automated software ver- ification

    Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F Fer- reira, Sorin Lerner, and Emily First. Rango: Adaptive retrieval-augmented proving for automated software ver- ification. In2025 IEEE/ACM 47th International Confer- ence on Software Engineering (ICSE), pages 347–359, 2025

  9. [9]

    Selene: Pio- neering automated proof in software verification

    Lichen Zhang, Shuai Lu, and Nan Duan. Selene: Pio- neering automated proof in software verification. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 1776–1789, 2024

  10. [10]

    Fvel: Interactive formal verification environment with large language models via theorem proving.Advances in Neural Information Processing Systems, 37:54932–54946, 2024

    Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, and Xiaodan Liang. Fvel: Interactive formal verification environment with large language models via theorem proving.Advances in Neural Information Processing Systems, 37:54932–54946, 2024

  11. [11]

    Using crash hoare logic for certifying the fscq file system

    Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chli- pala, M Frans Kaashoek, and Nickolai Zeldovich. Using crash hoare logic for certifying the fscq file system. In Proceedings of the 25th Symposium on Operating Sys- tems Principles, pages 18–37, 2015

  12. [12]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXi...

  13. [13]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  14. [14]

    Nit- pick: A counterexample generator for higher-order logic based on a relational model finder

    Jasmin Christian Blanchette and Tobias Nipkow. Nit- pick: A counterexample generator for higher-order logic based on a relational model finder. InInternational con- ference on interactive theorem proving, pages 131–146. Springer, 2010

  15. [15]

    Quickcheck: a lightweight tool for random testing of haskell programs

    Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. InProceedings of the fifth ACM SIGPLAN international conference on Functional programming, pages 268–279, 2000

  16. [16]

    Sledgehammer: judgement day

    Sascha Böhme and Tobias Nipkow. Sledgehammer: judgement day. InInternational Joint Conference on Automated Reasoning, pages 107–121. Springer, 2010

  17. [17]

    Internlm2.5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems

    Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Jiayu Wang, Dahua Lin, and Kai Chen. Internlm2. 5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems.arXiv preprint arXiv:2410.15700, 2024

  18. [18]

    Bfs-prover: Scalable best-first tree search for llm- based automatic theorem proving.arXiv preprint arXiv:2502.03438, 2025

    Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Kai Shen. Bfs-prover: Scalable best-first tree search for llm- based automatic theorem proving.arXiv preprint arXiv:2502.03438, 2025

  19. [19]

    Gemini: A family of highly capable multimodal models

    Google DeepMind. Gemini: A family of highly capable multimodal models. https://deepmind.google/ technologies/gemini/, 2024. Accessed model vari- ant: Gemini 3 Pro. 14

  20. [20]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

  21. [21]

    Leandojo: The- orem proving with retrieval-augmented language mod- els.Advances in Neural Information Processing Sys- tems, 36:21573–21612, 2023

    Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chala- mala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: The- orem proving with retrieval-augmented language mod- els.Advances in Neural Information Processing Sys- tems, 36:21573–21612, 2023

  22. [22]

    Llm library learning fails: A lego-prover case study.arXiv preprint arXiv:2504.03048, 2025

    Ian Berlot-Attwell, Frank Rudzicz, and Xujie Si. Llm library learning fails: A lego-prover case study.arXiv preprint arXiv:2504.03048, 2025

  23. [23]

    A compute-matched re-evaluation of trove on math.arXiv preprint arXiv:2507.22069, 2025

    Tobias Sesterhenn, Ian Berlot-Attwell, Janis Zenkner, and Christian Bartelt. A compute-matched re-evaluation of trove on math.arXiv preprint arXiv:2507.22069, 2025

  24. [24]

    Tacticzero: Learning to prove theo- rems from scratch with deep reinforcement learning

    Minchao Wu, Michael Norrish, Christian Walder, and Amir Dezfouli. Tacticzero: Learning to prove theo- rems from scratch with deep reinforcement learning. Advances in Neural Information Processing Systems, 34:9330–9342, 2021

  25. [25]

    Leanprogress: Guiding search for neural theorem proving via proof progress prediction.arXiv preprint arXiv:2502.17925,

    Suozhi Huang, Peiyang Song, Robert Joseph George, and Anima Anandkumar. Leanprogress: Guiding search for neural theorem proving via proof progress prediction. arXiv preprint arXiv:2502.17925, 2025

  26. [26]

    Neural theorem proving on inequality problems.Artifi- cial Intelligence and Theorem Proving (AITP), 2020

    Yuhuai Wu, Albert Jiang, Roger Grosse, and Jimmy Ba. Neural theorem proving on inequality problems.Artifi- cial Intelligence and Theorem Proving (AITP), 2020

  27. [27]

    Tactictoe: learning to prove with tactics.Journal of Automated Reasoning, 65(2):257–286, 2021

    Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ra- mana Kumar, and Michael Norrish. Tactictoe: learning to prove with tactics.Journal of Automated Reasoning, 65(2):257–286, 2021

  28. [28]

    Machine-learned premise selection for lean

    Bartosz Piotrowski, Ramon Fernández Mir, and Edward Ayers. Machine-learned premise selection for lean. In International Conference on Automated Reasoning with Analytic Tableaux and Related Methods, pages 175–186. Springer, 2023

  29. [29]

    Solve_direct: A tool for isabelle/hol to check whether a newly stated theorem can be solved directly by an existing theorem

    Timothy Bourke and Gerwin Klein. Solve_direct: A tool for isabelle/hol to check whether a newly stated theorem can be solved directly by an existing theorem. https://isabelle.in.tum.de/library/HOL/ HOL/ISABELLE_HOME/src/Tools/solve_direct. ML.html, 2025. File: src/Tools/solve_direct.ML in Isabelle/HOL library

  30. [30]

    scala-isabelle: A scala library for interacting with isabelle

    Dominique Unruh. scala-isabelle: A scala library for interacting with isabelle. https://github.com/ dominique-unruh/scala-isabelle, 2025. Ac- cessed 2025-11-18

  31. [31]

    N. M. Nishant and G. S. Mamatha. Using py4j for java-python communication.International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), 2023

  32. [32]

    Lisa: Language models of isabelle proofs

    Albert Qiaochu Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. Lisa: Language models of isabelle proofs. In6th Conference on Artificial Intelligence and Theorem Proving, pages 378–392, 2021

  33. [33]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  34. [34]

    Llamafactory: Unified efficient fine-tuning of 100+ lan- guage models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ lan- guage models. InProceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thai- land, 2024. Association for Computational Linguistics

  35. [35]

    Lightweight rele- vance filtering for machine-generated resolution prob- lems.Journal of Applied Logic, 7(1):41–57, 2009

    Jia Meng and Lawrence C Paulson. Lightweight rele- vance filtering for machine-generated resolution prob- lems.Journal of Applied Logic, 7(1):41–57, 2009

  36. [36]

    Mash: machine learning for sledgehammer

    Daniel Kühlwein, Jasmin Christian Blanchette, Cezary Kaliszyk, and Josef Urban. Mash: machine learning for sledgehammer. InInternational Conference on Interactive Theorem Proving, pages 35–50. Springer, 2013

  37. [37]

    seL4 enforces integrity

    Thomas Sewell, Simon Winwood, Peter Gammie, Toby Murray, June Andronick, and Gerwin Klein. seL4 enforces integrity. InProceedings of the Second Inter- national Conference on Interactive Theorem Proving, ITP ’11, pages 325–340, Berlin, Heidelberg, August

  38. [38]

    seL4: From general pur- pose to a proof of information flow enforcement

    Toby Murray, Daniel Matichuk, Matthew Brassil, Peter Gammie, Timothy Bourke, Sean Seefried, Corey Lewis, Xin Gao, and Gerwin Klein. seL4: From general pur- pose to a proof of information flow enforcement. In IEEE Symposium on Security and Privacy, pages 415– 429, San Francisco, CA, May 2013. IEEE

  39. [39]

    Empirical study towards a leading indicator for cost of formal soft- ware verification

    Daniel Matichuk, Toby Murray, June Andronick, Ross Jeffery, Gerwin Klein, and Mark Staples. Empirical study towards a leading indicator for cost of formal soft- ware verification. InProceedings of the 37th Interna- tional Conference on Software Engineering - Volume 1, ICSE ’15, page 722–732. IEEE Press, 2015

  40. [40]

    Binary coors capable or ‘correcting deletions, insertions, and reversals

    VI Lcvenshtcin. Binary coors capable or ‘correcting deletions, insertions, and reversals. InSoviet physics- doklady, volume 10, 1966. 15

  41. [41]

    Étude comparative de la distribution florale dans une portion des alpes et des jura.Bull Soc Vaudoise Sci Nat, 37:547–579, 1901

    Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura.Bull Soc Vaudoise Sci Nat, 37:547–579, 1901

  42. [42]

    X86 instruction semantics and basic block sym- bolic execution.Archive of Formal Proofs, Octo- ber 2021

    Freek Verbeek, Abhijith Bharadwaj, Joshua Bock- enek, Ian Roessle, Timmy Weerwag, and Binoy Ravin- dran. X86 instruction semantics and basic block sym- bolic execution.Archive of Formal Proofs, Octo- ber 2021. https://isa-afp.org/entries/X86_ Semantics.html, Formal proof development

  43. [43]

    A formal model of ieee floating point arithmetic.Archive of Formal Proofs, July

    Lei Yu. A formal model of ieee floating point arithmetic.Archive of Formal Proofs, July

  44. [44]

    https://isa-afp.org/entries/IEEE_ Floating_Point.html, Formal proof development

  45. [45]

    Formal verification of mod- ern sat solvers.Archive of Formal Proofs, July 2008

    Filip Mari ´c. Formal verification of mod- ern sat solvers.Archive of Formal Proofs, July 2008. https://isa-afp.org/entries/ SATSolverVerification.html, Formal proof devel- opment

  46. [46]

    The proba- bilistic relevance framework: Bm25 and beyond.Foun- dations and Trends® in Information Retrieval, 3(4):333– 389, 2009

    Stephen Robertson, Hugo Zaragoza, et al. The proba- bilistic relevance framework: Bm25 and beyond.Foun- dations and Trends® in Information Retrieval, 3(4):333– 389, 2009

  47. [47]

    {CertiKOS}: An extensible architecture for building certified concurrent {OS} kernels

    Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan New- man Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. {CertiKOS}: An extensible architecture for building certified concurrent {OS} kernels. In12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 653–669, 2016

  48. [48]

    Formally verified memory protection for a commodity multiprocessor hypervisor

    Shih-Wei Li, Xupeng Li, Ronghui Gu, Jason Nieh, and John Zhuang Hui. Formally verified memory protection for a commodity multiprocessor hypervisor. In30th USENIX Security Symposium (USENIX Security 21), pages 3953–3970. USENIX Association, August 2021

  49. [49]

    The design and implementation of vampire.AI communications, 15(2-3):91–110, 2002

    Alexandre Riazanov and Andrei V oronkov. The design and implementation of vampire.AI communications, 15(2-3):91–110, 2002

  50. [50]

    Faster, higher, stronger: E 2.3

    Stephan Schulz, Simon Cruanes, and Petar Vukmirovi´c. Faster, higher, stronger: E 2.3. InInternational Confer- ence on Automated Deduction, pages 495–507. Springer, 2019

  51. [51]

    Z3: An efficient smt solver

    Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient smt solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008

  52. [52]

    cvc5: A versatile and industrial-strength smt solver

    Haniel Barbosa, Clark Barrett, Martin Brain, Gereon Kremer, Hanna Lachnitt, Makai Mann, Abdalrhman Mohamed, Mudathir Mohamed, Aina Niemetz, Andres Nötzli, et al. cvc5: A versatile and industrial-strength smt solver. InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 415–442. Springer, 2022

  53. [53]

    Hammer for coq: Automation for dependent type theory.Journal of auto- mated reasoning, 61(1):423–453, 2018

    Łukasz Czajka and Cezary Kaliszyk. Hammer for coq: Automation for dependent type theory.Journal of auto- mated reasoning, 61(1):423–453, 2018

  54. [54]

    Ironclad apps: {End-to-End} security via auto- mated {Full-System} verification

    Chris Hawblitzel, Jon Howell, Jacob R Lorch, Ar- jun Narayan, Bryan Parno, Danfeng Zhang, and Brian Zill. Ironclad apps: {End-to-End} security via auto- mated {Full-System} verification. In11th USENIX symposium on operating systems design and implemen- tation (OSDI 14), pages 165–181, 2014

  55. [55]

    Safe to the last instruc- tion: automated verification of a type-safe operating system.Communications of the ACM, 54(12):123–131, 2011

    Jean Yang and Chris Hawblitzel. Safe to the last instruc- tion: automated verification of a type-safe operating system.Communications of the ACM, 54(12):123–131, 2011

  56. [56]

    Dafny: An automatic program verifier for functional correctness

    K Rustan M Leino. Dafny: An automatic program verifier for functional correctness. InInternational con- ference on logic for programming artificial intelligence and reasoning, pages 348–370. Springer, 2010

  57. [57]

    Boogie: A mod- ular reusable verifier for object-oriented programs

    Mike Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K Rustan M Leino. Boogie: A mod- ular reusable verifier for object-oriented programs. In International Symposium on Formal Methods for Com- ponents and Objects, pages 364–387. Springer, 2005

  58. [58]

    Holophrasm: a neural Automated Theorem Prover for higher-order logic

    Daniel Whalen. Holophrasm: a neural automated the- orem prover for higher-order logic.arXiv preprint arXiv:1608.02644, 2016

  59. [59]

    Generating correctness proofs with neural networks

    Alex Sanchez-Stern, Yousef Alhessi, Lawrence Saul, and Sorin Lerner. Generating correctness proofs with neural networks. InProceedings of the 4th ACM SIG- PLAN International Workshop on Machine Learning and Programming Languages, pages 1–10, 2020

  60. [60]

    Diversity-driven automated formal verification

    Emily First and Yuriy Brun. Diversity-driven automated formal verification. InProceedings of the 44th Inter- national Conference on Software Engineering, pages 749–761, 2022

  61. [61]

    Draft, sketch, and prove: Guiding formal theorem provers with informal proofs

    Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Timo- thee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  62. [62]

    Reviving dsp for advanced theorem proving in the era of reasoning models.arXiv preprint arXiv:2506.11487, 2025

    Chenrui Cao, Liangcheng Song, Zenan Li, Xinyi Le, Xian Zhang, Hui Xue, and Fan Yang. Reviving dsp for advanced theorem proving in the era of reasoning models.arXiv preprint arXiv:2506.11487, 2025. 16

  63. [63]

    Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

    Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

  64. [64]

    Baldur: Whole-proof generation and repair with large language models

    Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1229–1241, 2023

  65. [65]

    Proof automation with large language models

    Minghai Lu, Benjamin Delaware, and Tianyi Zhang. Proof automation with large language models. InPro- ceedings of the 39th IEEE/ACM International Confer- ence on Automated Software Engineering, pages 1509– 1520, 2024

  66. [66]

    Guided proof search using large language models and lemma extraction in coq

    Tarun Prasad and Nada Amin. Guided proof search using large language models and lemma extraction in coq. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2024

  67. [67]

    Generative Language Modeling for Automated Theorem Proving

    Stanislas Polu and Ilya Sutskever. Generative lan- guage modeling for automated theorem proving.arXiv preprint arXiv:2009.03393, 2020

  68. [68]

    Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337– 26349, 2022

    Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337– 26349, 2022

  69. [69]

    Thor: Wielding ham- mers to integrate language models and automated theo- rem provers.Advances in Neural Information Process- ing Systems, 35:8360–8373, 2022

    Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygó´ zd´ z, Piotr Miło´s, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding ham- mers to integrate language models and automated theo- rem provers.Advances in Neural Information Process- ing Systems, 35:8360–8373, 2022

  70. [70]

    Proofaug: Efficient neural theorem proving via fine-grained proof structure analysis.arXiv preprint arXiv:2501.18310, 2025

    Haoxiong Liu, Jiacheng Sun, Zhenguo Li, and An- drew C Yao. Proofaug: Efficient neural theorem proving via fine-grained proof structure analysis.arXiv preprint arXiv:2501.18310, 2025

  71. [71]

    Proof strategy extraction from llms for enhancing symbolic provers

    Jian Fang, Yican Sun, and Yingfei Xiong. Proof strategy extraction from llms for enhancing symbolic provers. arXiv preprint arXiv:2510.10131, 2025. 17