Neuro-Symbolic Proof Generation for Scaling Systems Software Verification

Baoding He; Taolue Chen; Wei Sun; Xiaoxing Ma; Yuan Yao; Zenan Li; Zhendong Su

arxiv: 2603.19715 · v2 · submitted 2026-03-20 · 💻 cs.AI

Neuro-Symbolic Proof Generation for Scaling Systems Software Verification

Baoding He , Zenan Li , Wei Sun , Yuan Yao , Taolue Chen , Xiaoxing Ma , Zhendong Su This is my paper

Pith reviewed 2026-05-15 08:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords neuro-symbolic AIautomated theorem provingsystems software verificationIsabelleseL4LLM fine-tuningproof search

0 comments

The pith

A neuro-symbolic framework proves up to 77.6 percent of theorems in the seL4 verification project by combining LLM step prediction with symbolic repair.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that automates proof search for large-scale systems verification by running a best-first tree search over proof states. At each state an LLM, fine-tuned on pairs of proof states and valid steps, proposes the next tactic or lemma. Symbolic components from the interactive theorem prover then repair invalid suggestions, prune unproductive branches, rank remaining states, and discharge simple subgoals when progress stalls. On the seL4 benchmark this hybrid process succeeds on far more theorems than prior LLM-only methods or standalone Sledgehammer, including many that require multiple reasoning steps. The same system also generalizes to other Isabelle developments without retraining.

Core claim

The neuro-symbolic proof generation framework automates proof search by performing best-first tree search over proof states, repeatedly querying a fine-tuned LLM for candidate next proof steps, and employing a suite of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls, achieving up to 77.6 percent success on the FVEL seL4 benchmark while solving significantly more multi-step proofs than previous approaches.

What carries the argument

Best-first tree search over proof states that queries a fine-tuned LLM for next candidate proof steps and applies symbolic ITP tools for repair, ranking, filtering, and subgoal discharge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be ported to other interactive theorem provers if they expose comparable fine-grained proof-state interfaces and automation hooks.
Further scaling might allow routine automated verification of entire critical software stacks that today require years of expert manual proof effort.
The same repair-and-rank loop could be applied to other search-based reasoning tasks such as program synthesis or hardware design verification.

Load-bearing premise

Fine-tuning an LLM on proof-state and step pairs plus symbolic repair will reliably produce valid next steps for unseen theorems without causing excessive unrepairable failures or search explosion.

What would settle it

Testing the framework on a fresh collection of theorems from seL4 or another large Isabelle project and checking whether the success rate falls substantially below 77.6 percent would directly test the central claim.

Figures

Figures reproduced from arXiv: 2603.19715 by Baoding He, Taolue Chen, Wei Sun, Xiaoxing Ma, Yuan Yao, Zenan Li, Zhendong Su.

**Figure 1.** Figure 1: An example theorem and its human-written proof from the seL4 project, shown alongside LLM-generated attempts. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of our neuro-symbolic proof-search framework. Starting from the initial proof state, the system repeatedly [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Proof success rate across ground-truth proof lengths. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Proof success rates across the six seL4 session cat [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Proof completion rate under varying proportions [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Sequence and Jaccard similarity between gener [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Performance of our proof search framework on ad [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for system-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6\% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further Isabelle benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The neuro-symbolic loop gets to 77.6% on seL4 by pairing fine-tuned LLM step prediction with Isabelle repair and ranking, but the abstract leaves train/test separation unclear enough that the generalization claim needs checking.

read the letter

The main thing to know is that this paper reports a concrete lift on the FVEL seL4 benchmark: up to 77.6% theorems proved using an LLM fine-tuned on proof-state/step pairs, wrapped in best-first search, with symbolic repair, filtering, and subgoal discharge via a new Isabelle REPL. It beats prior LLM baselines and standalone Sledgehammer, especially on multi-step proofs, and shows results on other Isabelle developments as well. The integration looks practical for lowering the manual effort in large systems verification projects like seL4. The REPL and the repair/ranking loop are the pieces that feel new relative to the cited work. Fine-tuning on state-step pairs rather than full proofs also keeps the adaptation data-efficient, which is a reasonable design choice. The numbers are specific enough to be worth looking at if you care about real ITP benchmarks. The soft spot is the evaluation details. The abstract gives the headline success rate but does not state whether the seL4 theorems were held out from the fine-tuning corpus or whether any decontamination was applied. If overlap exists, the reported improvement could partly reflect memorization instead of the claimed neuro-symbolic synergy. They also skip statistical significance, variance across runs, and a breakdown of where the symbolic repair succeeds or fails. Those are fixable but matter for the central claim. This paper is for people working on automated theorem proving and LLM use in formal methods. A reader who wants to see how neural next-step prediction can be grounded with existing ITP tools will get usable benchmark comparisons and a working framework description. It deserves peer review because the empirical results are grounded in public benchmarks and the approach is concrete enough for referees to evaluate the implementation and suggest targeted improvements on the data split and failure analysis.

Referee Report

3 major / 2 minor

Summary. The paper introduces a neuro-symbolic proof generation framework for Isabelle that performs best-first tree search, using a fine-tuned LLM to propose next proof steps and symbolic ITP tools for repair, filtering, ranking, and subgoal discharge. It reports up to 77.6% success on the FVEL seL4 benchmark (substantially above prior LLM baselines and standalone Sledgehammer) plus strong results on additional Isabelle developments, attributing gains to data-efficient adaptation and semantics-informed pruning.

Significance. If the central empirical claims hold under proper decontamination, the work offers a concrete, reproducible advance toward scaling automated verification of large systems like seL4 by showing that LLM next-step prediction plus symbolic repair can solve substantially more multi-step proofs than either component alone. The new fine-grained Isabelle REPL is a reusable engineering contribution that lowers the barrier for future neuro-symbolic experiments.

major comments (3)

[Training data section (likely §4)] Training data section (likely §4): the manuscript must explicitly state whether any seL4 theorems or their proof states were present in the fine-tuning corpus of proof-state/step pairs. Without a clear decontamination statement or train/test split description, the 77.6% success rate cannot be interpreted as evidence of generalization rather than memorization.
[Evaluation section (§5)] Evaluation section (§5): the reported 77.6% success rate and multi-step proof counts lack an explicit definition of what constitutes a solved theorem, how partial or repaired proofs are tallied, failure-mode categorization, or any statistical significance tests. These omissions leave the headline performance comparison only moderately supported.
[Framework description (§3)] Framework description (§3): while the high-level neuro-symbolic loop is clear, the paper should provide more detail on the exact repair heuristics, ranking function, and termination conditions used when the LLM step is rejected; without these, it is difficult to assess why the approach solves significantly more multi-step proofs than the baselines.

minor comments (2)

[Abstract] Abstract: the phrase 'additional Isabelle developments' should name the specific benchmarks (e.g., the exact theories or repositories) to allow readers to assess the scope of the generalization claim.
[Notation consistency] Notation consistency: ensure 'proof state' and 'proof step' are used uniformly; occasional shifts to 'goal' or 'tactic' without definition can confuse readers unfamiliar with Isabelle internals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications, which will strengthen the presentation of our methods and results.

read point-by-point responses

Referee: Training data section (likely §4): the manuscript must explicitly state whether any seL4 theorems or their proof states were present in the fine-tuning corpus of proof-state/step pairs. Without a clear decontamination statement or train/test split description, the 77.6% success rate cannot be interpreted as evidence of generalization rather than memorization.

Authors: We agree that an explicit decontamination statement is required for proper interpretation of the results. The fine-tuning corpus was constructed exclusively from Isabelle proof libraries and developments that exclude the seL4 benchmark. In the revised manuscript we will add a dedicated paragraph in Section 4 that (i) lists the exact sources used for the proof-state/step pairs, (ii) confirms that no seL4 theorems or their intermediate proof states appear in the training data, and (iii) describes the train/test split procedure to ensure no leakage. revision: yes
Referee: Evaluation section (§5): the reported 77.6% success rate and multi-step proof counts lack an explicit definition of what constitutes a solved theorem, how partial or repaired proofs are tallied, failure-mode categorization, or any statistical significance tests. These omissions leave the headline performance comparison only moderately supported.

Authors: We acknowledge the need for greater precision in the evaluation protocol. In the revised Section 5 we will add: (1) an explicit definition of a solved theorem (a complete, Isabelle-verified proof script with no remaining subgoals), (2) clarification that only fully verified proofs are counted as successes (repaired or partial proofs are not tallied as solved), (3) a breakdown of failure modes (e.g., search timeout, repair failure, LLM step rejection), and (4) statistical significance tests (McNemar’s test) comparing our method against the LLM-only and Sledgehammer baselines. These additions will make the performance claims more rigorously supported. revision: yes
Referee: Framework description (§3): while the high-level neuro-symbolic loop is clear, the paper should provide more detail on the exact repair heuristics, ranking function, and termination conditions used when the LLM step is rejected; without these, it is difficult to assess why the approach solves significantly more multi-step proofs than the baselines.

Authors: We agree that additional implementation details will help readers understand the performance gains. In the revised Section 3 we will expand the description of the symbolic components to include: (i) the precise repair heuristics (specific Isabelle tactics and auto methods invoked on rejected steps), (ii) the ranking function (a weighted combination of LLM log-probability, subgoal count, and proof-state complexity), and (iii) the termination conditions (maximum search depth of 20, per-node timeout of 30 seconds, and global budget of 500 LLM queries). These details will clarify how the neuro-symbolic integration enables solving more multi-step proofs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical success rates are measured outcomes on public benchmarks

full rationale

The paper describes a neuro-symbolic framework that fine-tunes LLMs on proof-state/step pairs and evaluates the resulting system on the FVEL seL4 benchmark and other Isabelle developments. Reported figures such as the 77.6% success rate are direct empirical measurements of theorems proved, not quantities derived from equations or parameters that are defined in terms of the same measured outcomes. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims rest on observable performance rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can be usefully fine-tuned for proof-step prediction and that symbolic tools can repair most invalid suggestions; no new entities are postulated and no free parameters are explicitly fitted in the abstract.

axioms (2)

domain assumption LLMs trained on proof state-step pairs can suggest useful next steps for unseen theorems
Invoked in the description of the neural component and fine-tuning process.
domain assumption Symbolic ITP tools can reliably repair, filter, and discharge subgoals generated by the LLM
Stated as part of the synergy that enables the search to progress.

pith-pipeline@v0.9.0 · 5544 in / 1341 out tokens · 36672 ms · 2026-05-15T08:58:33.560720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 5 internal anchors

[1]

Clarke and Jeannette M

Edmund M. Clarke and Jeannette M. Wing. Formal methods: state of the art and future directions.ACM Computing Surveys, 28(4):626–643, 1996

work page 1996
[2]

A static analyzer for large safety-critical software

Bruno Blanchet, Patrick Cousot, Radhia Cousot, Jérôme Feret, Laurent Mauborgne, Antoine Miné, David Mon- niaux, and Xavier Rival. A static analyzer for large safety-critical software. InProceedings of the ACM SIGPLAN Conference on Programming Language De- sign and Implementation (PLDI), pages 196–207, 2003

work page 2003
[3]

Formal methods: Practice and ex- perience.ACM Computing Surveys, 41(4):1–36, 2009

James Woodcock, Peter Gorm Larsen, Juan Bicarregui, and John Fitzgerald. Formal methods: Practice and ex- perience.ACM Computing Surveys, 41(4):1–36, 2009

work page 2009
[4]

Compcert-a formally verified optimizing compiler

Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bern- hard Schommer, Markus Pister, and Christian Ferdinand. Compcert-a formally verified optimizing compiler. In ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress, 2016

work page 2016
[5]

Comprehensive formal verification of an os microkernel.ACM Transactions on Computer Systems (TOCS), 32(1):1–70, 2014

Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Murray, Thomas Sewell, Rafal Kolanski, and Ger- not Heiser. Comprehensive formal verification of an os microkernel.ACM Transactions on Computer Systems (TOCS), 32(1):1–70, 2014

work page 2014
[6]

Qed at large: A survey of engineering of formally verified software.Foundations and Trends® in Programming Languages, 5(2-3):102– 281, 2019

Talia Ringer, Karl Palmskog, Ilya Sergey, Milos Glig- oric, Zachary Tatlock, et al. Qed at large: A survey of engineering of formally verified software.Foundations and Trends® in Programming Languages, 5(2-3):102– 281, 2019

work page 2019
[7]

Can large language models verify system software? a case study using fscq as a benchmark

Jianxing Qin, Alexander Du, Danfeng Zhang, Matthew Lentz, and Danyang Zhuo. Can large language models verify system software? a case study using fscq as a benchmark. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems, pages 34–41, 2025

work page 2025
[8]

Rango: Adaptive retrieval-augmented proving for automated software ver- ification

Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F Fer- reira, Sorin Lerner, and Emily First. Rango: Adaptive retrieval-augmented proving for automated software ver- ification. In2025 IEEE/ACM 47th International Confer- ence on Software Engineering (ICSE), pages 347–359, 2025

work page 2025
[9]

Selene: Pio- neering automated proof in software verification

Lichen Zhang, Shuai Lu, and Nan Duan. Selene: Pio- neering automated proof in software verification. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 1776–1789, 2024

work page 2024
[10]

Fvel: Interactive formal verification environment with large language models via theorem proving.Advances in Neural Information Processing Systems, 37:54932–54946, 2024

Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, and Xiaodan Liang. Fvel: Interactive formal verification environment with large language models via theorem proving.Advances in Neural Information Processing Systems, 37:54932–54946, 2024

work page 2024
[11]

Using crash hoare logic for certifying the fscq file system

Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chli- pala, M Frans Kaashoek, and Nickolai Zeldovich. Using crash hoare logic for certifying the fscq file system. In Proceedings of the 25th Symposium on Operating Sys- tems Principles, pages 18–37, 2015

work page 2015
[12]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[14]

Nit- pick: A counterexample generator for higher-order logic based on a relational model finder

Jasmin Christian Blanchette and Tobias Nipkow. Nit- pick: A counterexample generator for higher-order logic based on a relational model finder. InInternational con- ference on interactive theorem proving, pages 131–146. Springer, 2010

work page 2010
[15]

Quickcheck: a lightweight tool for random testing of haskell programs

Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. InProceedings of the fifth ACM SIGPLAN international conference on Functional programming, pages 268–279, 2000

work page 2000
[16]

Sledgehammer: judgement day

Sascha Böhme and Tobias Nipkow. Sledgehammer: judgement day. InInternational Joint Conference on Automated Reasoning, pages 107–121. Springer, 2010

work page 2010
[17]

Internlm2.5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems

Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Jiayu Wang, Dahua Lin, and Kai Chen. Internlm2. 5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems.arXiv preprint arXiv:2410.15700, 2024

work page arXiv 2024
[18]

Bfs-prover: Scalable best-first tree search for llm- based automatic theorem proving.arXiv preprint arXiv:2502.03438, 2025

Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Kai Shen. Bfs-prover: Scalable best-first tree search for llm- based automatic theorem proving.arXiv preprint arXiv:2502.03438, 2025

work page arXiv 2025
[19]

Gemini: A family of highly capable multimodal models

Google DeepMind. Gemini: A family of highly capable multimodal models. https://deepmind.google/ technologies/gemini/, 2024. Accessed model vari- ant: Gemini 3 Pro. 14

work page 2024
[20]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Leandojo: The- orem proving with retrieval-augmented language mod- els.Advances in Neural Information Processing Sys- tems, 36:21573–21612, 2023

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chala- mala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: The- orem proving with retrieval-augmented language mod- els.Advances in Neural Information Processing Sys- tems, 36:21573–21612, 2023

work page 2023
[22]

Llm library learning fails: A lego-prover case study.arXiv preprint arXiv:2504.03048, 2025

Ian Berlot-Attwell, Frank Rudzicz, and Xujie Si. Llm library learning fails: A lego-prover case study.arXiv preprint arXiv:2504.03048, 2025

work page arXiv 2025
[23]

A compute-matched re-evaluation of trove on math.arXiv preprint arXiv:2507.22069, 2025

Tobias Sesterhenn, Ian Berlot-Attwell, Janis Zenkner, and Christian Bartelt. A compute-matched re-evaluation of trove on math.arXiv preprint arXiv:2507.22069, 2025

work page arXiv 2025
[24]

Tacticzero: Learning to prove theo- rems from scratch with deep reinforcement learning

Minchao Wu, Michael Norrish, Christian Walder, and Amir Dezfouli. Tacticzero: Learning to prove theo- rems from scratch with deep reinforcement learning. Advances in Neural Information Processing Systems, 34:9330–9342, 2021

work page 2021
[25]

Leanprogress: Guiding search for neural theorem proving via proof progress prediction.arXiv preprint arXiv:2502.17925,

Suozhi Huang, Peiyang Song, Robert Joseph George, and Anima Anandkumar. Leanprogress: Guiding search for neural theorem proving via proof progress prediction. arXiv preprint arXiv:2502.17925, 2025

work page arXiv 2025
[26]

Neural theorem proving on inequality problems.Artifi- cial Intelligence and Theorem Proving (AITP), 2020

Yuhuai Wu, Albert Jiang, Roger Grosse, and Jimmy Ba. Neural theorem proving on inequality problems.Artifi- cial Intelligence and Theorem Proving (AITP), 2020

work page 2020
[27]

Tactictoe: learning to prove with tactics.Journal of Automated Reasoning, 65(2):257–286, 2021

Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ra- mana Kumar, and Michael Norrish. Tactictoe: learning to prove with tactics.Journal of Automated Reasoning, 65(2):257–286, 2021

work page 2021
[28]

Machine-learned premise selection for lean

Bartosz Piotrowski, Ramon Fernández Mir, and Edward Ayers. Machine-learned premise selection for lean. In International Conference on Automated Reasoning with Analytic Tableaux and Related Methods, pages 175–186. Springer, 2023

work page 2023
[29]

Solve_direct: A tool for isabelle/hol to check whether a newly stated theorem can be solved directly by an existing theorem

Timothy Bourke and Gerwin Klein. Solve_direct: A tool for isabelle/hol to check whether a newly stated theorem can be solved directly by an existing theorem. https://isabelle.in.tum.de/library/HOL/ HOL/ISABELLE_HOME/src/Tools/solve_direct. ML.html, 2025. File: src/Tools/solve_direct.ML in Isabelle/HOL library

work page 2025
[30]

scala-isabelle: A scala library for interacting with isabelle

Dominique Unruh. scala-isabelle: A scala library for interacting with isabelle. https://github.com/ dominique-unruh/scala-isabelle, 2025. Ac- cessed 2025-11-18

work page 2025
[31]

N. M. Nishant and G. S. Mamatha. Using py4j for java-python communication.International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), 2023

work page 2023
[32]

Lisa: Language models of isabelle proofs

Albert Qiaochu Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. Lisa: Language models of isabelle proofs. In6th Conference on Artificial Intelligence and Theorem Proving, pages 378–392, 2021

work page 2021
[33]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Llamafactory: Unified efficient fine-tuning of 100+ lan- guage models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ lan- guage models. InProceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thai- land, 2024. Association for Computational Linguistics

work page 2024
[35]

Lightweight rele- vance filtering for machine-generated resolution prob- lems.Journal of Applied Logic, 7(1):41–57, 2009

Jia Meng and Lawrence C Paulson. Lightweight rele- vance filtering for machine-generated resolution prob- lems.Journal of Applied Logic, 7(1):41–57, 2009

work page 2009
[36]

Mash: machine learning for sledgehammer

Daniel Kühlwein, Jasmin Christian Blanchette, Cezary Kaliszyk, and Josef Urban. Mash: machine learning for sledgehammer. InInternational Conference on Interactive Theorem Proving, pages 35–50. Springer, 2013

work page 2013
[37]

seL4 enforces integrity

Thomas Sewell, Simon Winwood, Peter Gammie, Toby Murray, June Andronick, and Gerwin Klein. seL4 enforces integrity. InProceedings of the Second Inter- national Conference on Interactive Theorem Proving, ITP ’11, pages 325–340, Berlin, Heidelberg, August

work page
[38]

seL4: From general pur- pose to a proof of information flow enforcement

Toby Murray, Daniel Matichuk, Matthew Brassil, Peter Gammie, Timothy Bourke, Sean Seefried, Corey Lewis, Xin Gao, and Gerwin Klein. seL4: From general pur- pose to a proof of information flow enforcement. In IEEE Symposium on Security and Privacy, pages 415– 429, San Francisco, CA, May 2013. IEEE

work page 2013
[39]

Empirical study towards a leading indicator for cost of formal soft- ware verification

Daniel Matichuk, Toby Murray, June Andronick, Ross Jeffery, Gerwin Klein, and Mark Staples. Empirical study towards a leading indicator for cost of formal soft- ware verification. InProceedings of the 37th Interna- tional Conference on Software Engineering - Volume 1, ICSE ’15, page 722–732. IEEE Press, 2015

work page 2015
[40]

Binary coors capable or ‘correcting deletions, insertions, and reversals

VI Lcvenshtcin. Binary coors capable or ‘correcting deletions, insertions, and reversals. InSoviet physics- doklady, volume 10, 1966. 15

work page 1966
[41]

Étude comparative de la distribution florale dans une portion des alpes et des jura.Bull Soc Vaudoise Sci Nat, 37:547–579, 1901

Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura.Bull Soc Vaudoise Sci Nat, 37:547–579, 1901

work page 1901
[42]

X86 instruction semantics and basic block sym- bolic execution.Archive of Formal Proofs, Octo- ber 2021

Freek Verbeek, Abhijith Bharadwaj, Joshua Bock- enek, Ian Roessle, Timmy Weerwag, and Binoy Ravin- dran. X86 instruction semantics and basic block sym- bolic execution.Archive of Formal Proofs, Octo- ber 2021. https://isa-afp.org/entries/X86_ Semantics.html, Formal proof development

work page 2021
[43]

A formal model of ieee floating point arithmetic.Archive of Formal Proofs, July

Lei Yu. A formal model of ieee floating point arithmetic.Archive of Formal Proofs, July

work page
[44]

https://isa-afp.org/entries/IEEE_ Floating_Point.html, Formal proof development

work page
[45]

Formal verification of mod- ern sat solvers.Archive of Formal Proofs, July 2008

Filip Mari ´c. Formal verification of mod- ern sat solvers.Archive of Formal Proofs, July 2008. https://isa-afp.org/entries/ SATSolverVerification.html, Formal proof devel- opment

work page 2008
[46]

The proba- bilistic relevance framework: Bm25 and beyond.Foun- dations and Trends® in Information Retrieval, 3(4):333– 389, 2009

Stephen Robertson, Hugo Zaragoza, et al. The proba- bilistic relevance framework: Bm25 and beyond.Foun- dations and Trends® in Information Retrieval, 3(4):333– 389, 2009

work page 2009
[47]

{CertiKOS}: An extensible architecture for building certified concurrent {OS} kernels

Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan New- man Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. {CertiKOS}: An extensible architecture for building certified concurrent {OS} kernels. In12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 653–669, 2016

work page 2016
[48]

Formally verified memory protection for a commodity multiprocessor hypervisor

Shih-Wei Li, Xupeng Li, Ronghui Gu, Jason Nieh, and John Zhuang Hui. Formally verified memory protection for a commodity multiprocessor hypervisor. In30th USENIX Security Symposium (USENIX Security 21), pages 3953–3970. USENIX Association, August 2021

work page 2021
[49]

The design and implementation of vampire.AI communications, 15(2-3):91–110, 2002

Alexandre Riazanov and Andrei V oronkov. The design and implementation of vampire.AI communications, 15(2-3):91–110, 2002

work page 2002
[50]

Faster, higher, stronger: E 2.3

Stephan Schulz, Simon Cruanes, and Petar Vukmirovi´c. Faster, higher, stronger: E 2.3. InInternational Confer- ence on Automated Deduction, pages 495–507. Springer, 2019

work page 2019
[51]

Z3: An efficient smt solver

Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient smt solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008

work page 2008
[52]

cvc5: A versatile and industrial-strength smt solver

Haniel Barbosa, Clark Barrett, Martin Brain, Gereon Kremer, Hanna Lachnitt, Makai Mann, Abdalrhman Mohamed, Mudathir Mohamed, Aina Niemetz, Andres Nötzli, et al. cvc5: A versatile and industrial-strength smt solver. InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 415–442. Springer, 2022

work page 2022
[53]

Hammer for coq: Automation for dependent type theory.Journal of auto- mated reasoning, 61(1):423–453, 2018

Łukasz Czajka and Cezary Kaliszyk. Hammer for coq: Automation for dependent type theory.Journal of auto- mated reasoning, 61(1):423–453, 2018

work page 2018
[54]

Ironclad apps: {End-to-End} security via auto- mated {Full-System} verification

Chris Hawblitzel, Jon Howell, Jacob R Lorch, Ar- jun Narayan, Bryan Parno, Danfeng Zhang, and Brian Zill. Ironclad apps: {End-to-End} security via auto- mated {Full-System} verification. In11th USENIX symposium on operating systems design and implemen- tation (OSDI 14), pages 165–181, 2014

work page 2014
[55]

Safe to the last instruc- tion: automated verification of a type-safe operating system.Communications of the ACM, 54(12):123–131, 2011

Jean Yang and Chris Hawblitzel. Safe to the last instruc- tion: automated verification of a type-safe operating system.Communications of the ACM, 54(12):123–131, 2011

work page 2011
[56]

Dafny: An automatic program verifier for functional correctness

K Rustan M Leino. Dafny: An automatic program verifier for functional correctness. InInternational con- ference on logic for programming artificial intelligence and reasoning, pages 348–370. Springer, 2010

work page 2010
[57]

Boogie: A mod- ular reusable verifier for object-oriented programs

Mike Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K Rustan M Leino. Boogie: A mod- ular reusable verifier for object-oriented programs. In International Symposium on Formal Methods for Com- ponents and Objects, pages 364–387. Springer, 2005

work page 2005
[58]

Holophrasm: a neural Automated Theorem Prover for higher-order logic

Daniel Whalen. Holophrasm: a neural automated the- orem prover for higher-order logic.arXiv preprint arXiv:1608.02644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[59]

Generating correctness proofs with neural networks

Alex Sanchez-Stern, Yousef Alhessi, Lawrence Saul, and Sorin Lerner. Generating correctness proofs with neural networks. InProceedings of the 4th ACM SIG- PLAN International Workshop on Machine Learning and Programming Languages, pages 1–10, 2020

work page 2020
[60]

Diversity-driven automated formal verification

Emily First and Yuriy Brun. Diversity-driven automated formal verification. InProceedings of the 44th Inter- national Conference on Software Engineering, pages 749–761, 2022

work page 2022
[61]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs

Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Timo- thee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023
[62]

Reviving dsp for advanced theorem proving in the era of reasoning models.arXiv preprint arXiv:2506.11487, 2025

Chenrui Cao, Liangcheng Song, Zenan Li, Xinyi Le, Xian Zhang, Hui Xue, and Fan Yang. Reviving dsp for advanced theorem proving in the era of reasoning models.arXiv preprint arXiv:2506.11487, 2025. 16

work page arXiv 2025
[63]

Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

work page 2024
[64]

Baldur: Whole-proof generation and repair with large language models

Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1229–1241, 2023

work page 2023
[65]

Proof automation with large language models

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. Proof automation with large language models. InPro- ceedings of the 39th IEEE/ACM International Confer- ence on Automated Software Engineering, pages 1509– 1520, 2024

work page 2024
[66]

Guided proof search using large language models and lemma extraction in coq

Tarun Prasad and Nada Amin. Guided proof search using large language models and lemma extraction in coq. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2024

work page 2025
[67]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative lan- guage modeling for automated theorem proving.arXiv preprint arXiv:2009.03393, 2020

work page internal anchor Pith review arXiv 2009
[68]

Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337– 26349, 2022

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337– 26349, 2022

work page 2022
[69]

Thor: Wielding ham- mers to integrate language models and automated theo- rem provers.Advances in Neural Information Process- ing Systems, 35:8360–8373, 2022

Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygó´ zd´ z, Piotr Miło´s, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding ham- mers to integrate language models and automated theo- rem provers.Advances in Neural Information Process- ing Systems, 35:8360–8373, 2022

work page 2022
[70]

Proofaug: Efficient neural theorem proving via fine-grained proof structure analysis.arXiv preprint arXiv:2501.18310, 2025

Haoxiong Liu, Jiacheng Sun, Zhenguo Li, and An- drew C Yao. Proofaug: Efficient neural theorem proving via fine-grained proof structure analysis.arXiv preprint arXiv:2501.18310, 2025

work page arXiv 2025
[71]

Proof strategy extraction from llms for enhancing symbolic provers

Jian Fang, Yican Sun, and Yingfei Xiong. Proof strategy extraction from llms for enhancing symbolic provers. arXiv preprint arXiv:2510.10131, 2025. 17

work page arXiv 2025

[1] [1]

Clarke and Jeannette M

Edmund M. Clarke and Jeannette M. Wing. Formal methods: state of the art and future directions.ACM Computing Surveys, 28(4):626–643, 1996

work page 1996

[2] [2]

A static analyzer for large safety-critical software

Bruno Blanchet, Patrick Cousot, Radhia Cousot, Jérôme Feret, Laurent Mauborgne, Antoine Miné, David Mon- niaux, and Xavier Rival. A static analyzer for large safety-critical software. InProceedings of the ACM SIGPLAN Conference on Programming Language De- sign and Implementation (PLDI), pages 196–207, 2003

work page 2003

[3] [3]

Formal methods: Practice and ex- perience.ACM Computing Surveys, 41(4):1–36, 2009

James Woodcock, Peter Gorm Larsen, Juan Bicarregui, and John Fitzgerald. Formal methods: Practice and ex- perience.ACM Computing Surveys, 41(4):1–36, 2009

work page 2009

[4] [4]

Compcert-a formally verified optimizing compiler

Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bern- hard Schommer, Markus Pister, and Christian Ferdinand. Compcert-a formally verified optimizing compiler. In ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress, 2016

work page 2016

[5] [5]

Comprehensive formal verification of an os microkernel.ACM Transactions on Computer Systems (TOCS), 32(1):1–70, 2014

Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Murray, Thomas Sewell, Rafal Kolanski, and Ger- not Heiser. Comprehensive formal verification of an os microkernel.ACM Transactions on Computer Systems (TOCS), 32(1):1–70, 2014

work page 2014

[6] [6]

Qed at large: A survey of engineering of formally verified software.Foundations and Trends® in Programming Languages, 5(2-3):102– 281, 2019

Talia Ringer, Karl Palmskog, Ilya Sergey, Milos Glig- oric, Zachary Tatlock, et al. Qed at large: A survey of engineering of formally verified software.Foundations and Trends® in Programming Languages, 5(2-3):102– 281, 2019

work page 2019

[7] [7]

Can large language models verify system software? a case study using fscq as a benchmark

Jianxing Qin, Alexander Du, Danfeng Zhang, Matthew Lentz, and Danyang Zhuo. Can large language models verify system software? a case study using fscq as a benchmark. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems, pages 34–41, 2025

work page 2025

[8] [8]

Rango: Adaptive retrieval-augmented proving for automated software ver- ification

Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F Fer- reira, Sorin Lerner, and Emily First. Rango: Adaptive retrieval-augmented proving for automated software ver- ification. In2025 IEEE/ACM 47th International Confer- ence on Software Engineering (ICSE), pages 347–359, 2025

work page 2025

[9] [9]

Selene: Pio- neering automated proof in software verification

Lichen Zhang, Shuai Lu, and Nan Duan. Selene: Pio- neering automated proof in software verification. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 1776–1789, 2024

work page 2024

[10] [10]

Fvel: Interactive formal verification environment with large language models via theorem proving.Advances in Neural Information Processing Systems, 37:54932–54946, 2024

Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, and Xiaodan Liang. Fvel: Interactive formal verification environment with large language models via theorem proving.Advances in Neural Information Processing Systems, 37:54932–54946, 2024

work page 2024

[11] [11]

Using crash hoare logic for certifying the fscq file system

Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chli- pala, M Frans Kaashoek, and Nickolai Zeldovich. Using crash hoare logic for certifying the fscq file system. In Proceedings of the 25th Symposium on Operating Sys- tems Principles, pages 18–37, 2015

work page 2015

[12] [12]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024

[14] [14]

Nit- pick: A counterexample generator for higher-order logic based on a relational model finder

Jasmin Christian Blanchette and Tobias Nipkow. Nit- pick: A counterexample generator for higher-order logic based on a relational model finder. InInternational con- ference on interactive theorem proving, pages 131–146. Springer, 2010

work page 2010

[15] [15]

Quickcheck: a lightweight tool for random testing of haskell programs

Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. InProceedings of the fifth ACM SIGPLAN international conference on Functional programming, pages 268–279, 2000

work page 2000

[16] [16]

Sledgehammer: judgement day

Sascha Böhme and Tobias Nipkow. Sledgehammer: judgement day. InInternational Joint Conference on Automated Reasoning, pages 107–121. Springer, 2010

work page 2010

[17] [17]

Internlm2.5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems

Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Jiayu Wang, Dahua Lin, and Kai Chen. Internlm2. 5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems.arXiv preprint arXiv:2410.15700, 2024

work page arXiv 2024

[18] [18]

Bfs-prover: Scalable best-first tree search for llm- based automatic theorem proving.arXiv preprint arXiv:2502.03438, 2025

Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Kai Shen. Bfs-prover: Scalable best-first tree search for llm- based automatic theorem proving.arXiv preprint arXiv:2502.03438, 2025

work page arXiv 2025

[19] [19]

Gemini: A family of highly capable multimodal models

Google DeepMind. Gemini: A family of highly capable multimodal models. https://deepmind.google/ technologies/gemini/, 2024. Accessed model vari- ant: Gemini 3 Pro. 14

work page 2024

[20] [20]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Leandojo: The- orem proving with retrieval-augmented language mod- els.Advances in Neural Information Processing Sys- tems, 36:21573–21612, 2023

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chala- mala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: The- orem proving with retrieval-augmented language mod- els.Advances in Neural Information Processing Sys- tems, 36:21573–21612, 2023

work page 2023

[22] [22]

Llm library learning fails: A lego-prover case study.arXiv preprint arXiv:2504.03048, 2025

Ian Berlot-Attwell, Frank Rudzicz, and Xujie Si. Llm library learning fails: A lego-prover case study.arXiv preprint arXiv:2504.03048, 2025

work page arXiv 2025

[23] [23]

A compute-matched re-evaluation of trove on math.arXiv preprint arXiv:2507.22069, 2025

Tobias Sesterhenn, Ian Berlot-Attwell, Janis Zenkner, and Christian Bartelt. A compute-matched re-evaluation of trove on math.arXiv preprint arXiv:2507.22069, 2025

work page arXiv 2025

[24] [24]

Tacticzero: Learning to prove theo- rems from scratch with deep reinforcement learning

Minchao Wu, Michael Norrish, Christian Walder, and Amir Dezfouli. Tacticzero: Learning to prove theo- rems from scratch with deep reinforcement learning. Advances in Neural Information Processing Systems, 34:9330–9342, 2021

work page 2021

[25] [25]

Leanprogress: Guiding search for neural theorem proving via proof progress prediction.arXiv preprint arXiv:2502.17925,

Suozhi Huang, Peiyang Song, Robert Joseph George, and Anima Anandkumar. Leanprogress: Guiding search for neural theorem proving via proof progress prediction. arXiv preprint arXiv:2502.17925, 2025

work page arXiv 2025

[26] [26]

Neural theorem proving on inequality problems.Artifi- cial Intelligence and Theorem Proving (AITP), 2020

Yuhuai Wu, Albert Jiang, Roger Grosse, and Jimmy Ba. Neural theorem proving on inequality problems.Artifi- cial Intelligence and Theorem Proving (AITP), 2020

work page 2020

[27] [27]

Tactictoe: learning to prove with tactics.Journal of Automated Reasoning, 65(2):257–286, 2021

Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ra- mana Kumar, and Michael Norrish. Tactictoe: learning to prove with tactics.Journal of Automated Reasoning, 65(2):257–286, 2021

work page 2021

[28] [28]

Machine-learned premise selection for lean

Bartosz Piotrowski, Ramon Fernández Mir, and Edward Ayers. Machine-learned premise selection for lean. In International Conference on Automated Reasoning with Analytic Tableaux and Related Methods, pages 175–186. Springer, 2023

work page 2023

[29] [29]

Solve_direct: A tool for isabelle/hol to check whether a newly stated theorem can be solved directly by an existing theorem

Timothy Bourke and Gerwin Klein. Solve_direct: A tool for isabelle/hol to check whether a newly stated theorem can be solved directly by an existing theorem. https://isabelle.in.tum.de/library/HOL/ HOL/ISABELLE_HOME/src/Tools/solve_direct. ML.html, 2025. File: src/Tools/solve_direct.ML in Isabelle/HOL library

work page 2025

[30] [30]

scala-isabelle: A scala library for interacting with isabelle

Dominique Unruh. scala-isabelle: A scala library for interacting with isabelle. https://github.com/ dominique-unruh/scala-isabelle, 2025. Ac- cessed 2025-11-18

work page 2025

[31] [31]

N. M. Nishant and G. S. Mamatha. Using py4j for java-python communication.International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), 2023

work page 2023

[32] [32]

Lisa: Language models of isabelle proofs

Albert Qiaochu Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. Lisa: Language models of isabelle proofs. In6th Conference on Artificial Intelligence and Theorem Proving, pages 378–392, 2021

work page 2021

[33] [33]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Llamafactory: Unified efficient fine-tuning of 100+ lan- guage models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ lan- guage models. InProceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thai- land, 2024. Association for Computational Linguistics

work page 2024

[35] [35]

Lightweight rele- vance filtering for machine-generated resolution prob- lems.Journal of Applied Logic, 7(1):41–57, 2009

Jia Meng and Lawrence C Paulson. Lightweight rele- vance filtering for machine-generated resolution prob- lems.Journal of Applied Logic, 7(1):41–57, 2009

work page 2009

[36] [36]

Mash: machine learning for sledgehammer

Daniel Kühlwein, Jasmin Christian Blanchette, Cezary Kaliszyk, and Josef Urban. Mash: machine learning for sledgehammer. InInternational Conference on Interactive Theorem Proving, pages 35–50. Springer, 2013

work page 2013

[37] [37]

seL4 enforces integrity

Thomas Sewell, Simon Winwood, Peter Gammie, Toby Murray, June Andronick, and Gerwin Klein. seL4 enforces integrity. InProceedings of the Second Inter- national Conference on Interactive Theorem Proving, ITP ’11, pages 325–340, Berlin, Heidelberg, August

work page

[38] [38]

seL4: From general pur- pose to a proof of information flow enforcement

Toby Murray, Daniel Matichuk, Matthew Brassil, Peter Gammie, Timothy Bourke, Sean Seefried, Corey Lewis, Xin Gao, and Gerwin Klein. seL4: From general pur- pose to a proof of information flow enforcement. In IEEE Symposium on Security and Privacy, pages 415– 429, San Francisco, CA, May 2013. IEEE

work page 2013

[39] [39]

Empirical study towards a leading indicator for cost of formal soft- ware verification

Daniel Matichuk, Toby Murray, June Andronick, Ross Jeffery, Gerwin Klein, and Mark Staples. Empirical study towards a leading indicator for cost of formal soft- ware verification. InProceedings of the 37th Interna- tional Conference on Software Engineering - Volume 1, ICSE ’15, page 722–732. IEEE Press, 2015

work page 2015

[40] [40]

Binary coors capable or ‘correcting deletions, insertions, and reversals

VI Lcvenshtcin. Binary coors capable or ‘correcting deletions, insertions, and reversals. InSoviet physics- doklady, volume 10, 1966. 15

work page 1966

[41] [41]

Étude comparative de la distribution florale dans une portion des alpes et des jura.Bull Soc Vaudoise Sci Nat, 37:547–579, 1901

Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura.Bull Soc Vaudoise Sci Nat, 37:547–579, 1901

work page 1901

[42] [42]

X86 instruction semantics and basic block sym- bolic execution.Archive of Formal Proofs, Octo- ber 2021

Freek Verbeek, Abhijith Bharadwaj, Joshua Bock- enek, Ian Roessle, Timmy Weerwag, and Binoy Ravin- dran. X86 instruction semantics and basic block sym- bolic execution.Archive of Formal Proofs, Octo- ber 2021. https://isa-afp.org/entries/X86_ Semantics.html, Formal proof development

work page 2021

[43] [43]

A formal model of ieee floating point arithmetic.Archive of Formal Proofs, July

Lei Yu. A formal model of ieee floating point arithmetic.Archive of Formal Proofs, July

work page

[44] [44]

https://isa-afp.org/entries/IEEE_ Floating_Point.html, Formal proof development

work page

[45] [45]

Formal verification of mod- ern sat solvers.Archive of Formal Proofs, July 2008

Filip Mari ´c. Formal verification of mod- ern sat solvers.Archive of Formal Proofs, July 2008. https://isa-afp.org/entries/ SATSolverVerification.html, Formal proof devel- opment

work page 2008

[46] [46]

The proba- bilistic relevance framework: Bm25 and beyond.Foun- dations and Trends® in Information Retrieval, 3(4):333– 389, 2009

Stephen Robertson, Hugo Zaragoza, et al. The proba- bilistic relevance framework: Bm25 and beyond.Foun- dations and Trends® in Information Retrieval, 3(4):333– 389, 2009

work page 2009

[47] [47]

{CertiKOS}: An extensible architecture for building certified concurrent {OS} kernels

Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan New- man Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. {CertiKOS}: An extensible architecture for building certified concurrent {OS} kernels. In12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 653–669, 2016

work page 2016

[48] [48]

Formally verified memory protection for a commodity multiprocessor hypervisor

Shih-Wei Li, Xupeng Li, Ronghui Gu, Jason Nieh, and John Zhuang Hui. Formally verified memory protection for a commodity multiprocessor hypervisor. In30th USENIX Security Symposium (USENIX Security 21), pages 3953–3970. USENIX Association, August 2021

work page 2021

[49] [49]

The design and implementation of vampire.AI communications, 15(2-3):91–110, 2002

Alexandre Riazanov and Andrei V oronkov. The design and implementation of vampire.AI communications, 15(2-3):91–110, 2002

work page 2002

[50] [50]

Faster, higher, stronger: E 2.3

Stephan Schulz, Simon Cruanes, and Petar Vukmirovi´c. Faster, higher, stronger: E 2.3. InInternational Confer- ence on Automated Deduction, pages 495–507. Springer, 2019

work page 2019

[51] [51]

Z3: An efficient smt solver

Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient smt solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008

work page 2008

[52] [52]

cvc5: A versatile and industrial-strength smt solver

Haniel Barbosa, Clark Barrett, Martin Brain, Gereon Kremer, Hanna Lachnitt, Makai Mann, Abdalrhman Mohamed, Mudathir Mohamed, Aina Niemetz, Andres Nötzli, et al. cvc5: A versatile and industrial-strength smt solver. InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 415–442. Springer, 2022

work page 2022

[53] [53]

Hammer for coq: Automation for dependent type theory.Journal of auto- mated reasoning, 61(1):423–453, 2018

Łukasz Czajka and Cezary Kaliszyk. Hammer for coq: Automation for dependent type theory.Journal of auto- mated reasoning, 61(1):423–453, 2018

work page 2018

[54] [54]

Ironclad apps: {End-to-End} security via auto- mated {Full-System} verification

Chris Hawblitzel, Jon Howell, Jacob R Lorch, Ar- jun Narayan, Bryan Parno, Danfeng Zhang, and Brian Zill. Ironclad apps: {End-to-End} security via auto- mated {Full-System} verification. In11th USENIX symposium on operating systems design and implemen- tation (OSDI 14), pages 165–181, 2014

work page 2014

[55] [55]

Safe to the last instruc- tion: automated verification of a type-safe operating system.Communications of the ACM, 54(12):123–131, 2011

Jean Yang and Chris Hawblitzel. Safe to the last instruc- tion: automated verification of a type-safe operating system.Communications of the ACM, 54(12):123–131, 2011

work page 2011

[56] [56]

Dafny: An automatic program verifier for functional correctness

K Rustan M Leino. Dafny: An automatic program verifier for functional correctness. InInternational con- ference on logic for programming artificial intelligence and reasoning, pages 348–370. Springer, 2010

work page 2010

[57] [57]

Boogie: A mod- ular reusable verifier for object-oriented programs

Mike Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K Rustan M Leino. Boogie: A mod- ular reusable verifier for object-oriented programs. In International Symposium on Formal Methods for Com- ponents and Objects, pages 364–387. Springer, 2005

work page 2005

[58] [58]

Holophrasm: a neural Automated Theorem Prover for higher-order logic

Daniel Whalen. Holophrasm: a neural automated the- orem prover for higher-order logic.arXiv preprint arXiv:1608.02644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[59] [59]

Generating correctness proofs with neural networks

Alex Sanchez-Stern, Yousef Alhessi, Lawrence Saul, and Sorin Lerner. Generating correctness proofs with neural networks. InProceedings of the 4th ACM SIG- PLAN International Workshop on Machine Learning and Programming Languages, pages 1–10, 2020

work page 2020

[60] [60]

Diversity-driven automated formal verification

Emily First and Yuriy Brun. Diversity-driven automated formal verification. InProceedings of the 44th Inter- national Conference on Software Engineering, pages 749–761, 2022

work page 2022

[61] [61]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs

Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Timo- thee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023

[62] [62]

Reviving dsp for advanced theorem proving in the era of reasoning models.arXiv preprint arXiv:2506.11487, 2025

Chenrui Cao, Liangcheng Song, Zenan Li, Xinyi Le, Xian Zhang, Hui Xue, and Fan Yang. Reviving dsp for advanced theorem proving in the era of reasoning models.arXiv preprint arXiv:2506.11487, 2025. 16

work page arXiv 2025

[63] [63]

Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

work page 2024

[64] [64]

Baldur: Whole-proof generation and repair with large language models

Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1229–1241, 2023

work page 2023

[65] [65]

Proof automation with large language models

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. Proof automation with large language models. InPro- ceedings of the 39th IEEE/ACM International Confer- ence on Automated Software Engineering, pages 1509– 1520, 2024

work page 2024

[66] [66]

Guided proof search using large language models and lemma extraction in coq

Tarun Prasad and Nada Amin. Guided proof search using large language models and lemma extraction in coq. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2024

work page 2025

[67] [67]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative lan- guage modeling for automated theorem proving.arXiv preprint arXiv:2009.03393, 2020

work page internal anchor Pith review arXiv 2009

[68] [68]

Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337– 26349, 2022

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337– 26349, 2022

work page 2022

[69] [69]

Thor: Wielding ham- mers to integrate language models and automated theo- rem provers.Advances in Neural Information Process- ing Systems, 35:8360–8373, 2022

Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygó´ zd´ z, Piotr Miło´s, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding ham- mers to integrate language models and automated theo- rem provers.Advances in Neural Information Process- ing Systems, 35:8360–8373, 2022

work page 2022

[70] [70]

Proofaug: Efficient neural theorem proving via fine-grained proof structure analysis.arXiv preprint arXiv:2501.18310, 2025

Haoxiong Liu, Jiacheng Sun, Zhenguo Li, and An- drew C Yao. Proofaug: Efficient neural theorem proving via fine-grained proof structure analysis.arXiv preprint arXiv:2501.18310, 2025

work page arXiv 2025

[71] [71]

Proof strategy extraction from llms for enhancing symbolic provers

Jian Fang, Yican Sun, and Yingfei Xiong. Proof strategy extraction from llms for enhancing symbolic provers. arXiv preprint arXiv:2510.10131, 2025. 17

work page arXiv 2025