Agentic Verification of Software Systems

Abhik Roychoudhury; Haoxin Tu; Huan Zhao; Mehtab Zafar; Ruijie Meng; Yahui Song

arxiv: 2511.17330 · v3 · submitted 2025-11-21 · 💻 cs.SE

Agentic Verification of Software Systems

Haoxin Tu , Huan Zhao , Yahui Song , Mehtab Zafar , Ruijie Meng , Abhik Roychoudhury This is my paper

Pith reviewed 2026-05-17 20:16 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM agentprogram verificationtheorem proverRocqautomated verificationsoftware correctnessAI-generated code

0 comments

The pith

An LLM agent can automatically verify software by iteratively refining proofs through collaboration with the Rocq theorem prover.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoRocq, an LLM-based proof agent that constructs program proofs without prior training on examples. It improves proofs on the fly by exchanging messages with the Rocq theorem prover, using the prover's feedback to adjust proof structure and search steps until a complete derivation is accepted. This setup creates an autonomous loop between agent and prover. If the method works for realistic code, it could close the gap between AI code generation and reliable verification, enabling trusted automatic programming for large systems.

Core claim

The agent learns on-the-fly and improves the proof via an iterative refinement loop achieved by the proof agent communicating with the Rocq theorem prover to get additional context and feedback. The final result of the iteration is a proof derivation checked by the Rocq theorem prover, involving autonomous collaboration between the proof agent and the theorem prover.

What carries the argument

The iterative refinement loop that enables the proof agent to communicate with the Rocq theorem prover for context and feedback, guiding proof tree construction and decision-making.

If this is right

The agent shows promising results on SV-COMP benchmarks for automated verification.
It extends to verification of Linux kernel modules.
The approach supports integration with AI coding agents to form a generate-and-validate loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If convergence holds for larger modules, the method could verify entire subsystems generated by AI coding tools.
The same agent-prover loop pattern might apply to other formal systems beyond Rocq.
Success would reduce the need for hand-crafted proof scripts in routine software correctness checks.

Load-bearing premise

Iterative communication between the LLM agent and the theorem prover will converge to a complete, accepted proof for non-trivial programs without excessive failures or human intervention.

What would settle it

Running the agent on the full SV-COMP benchmark suite and counting how many tasks produce a Rocq-accepted proof within a fixed number of refinement rounds without manual fixes.

Figures

Figures reproduced from arXiv: 2511.17330 by Abhik Roychoudhury, Haoxin Tu, Huan Zhao, Mehtab Zafar, Ruijie Meng, Yahui Song.

**Figure 1.** Figure 1: (a) Proof obligation wp_goal extracted from benchmark52_polynomial in SV-COMP [7] and its proof generated by AutoRocq, with (b) the proof tree constructed during proving, where lemmas highlighted are retrieved from the global context autonomously by AutoRocq. and none of them can prove this lemma during our experiments. We highlight these challenges to motivate AutoRocq’s agentic design in the following. V… view at source ↗

**Figure 2.** Figure 2: Overview of AutoRocq, where components involving decision-making by LLMs are highlighted [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Histogram of different complexity metrics: lemmas from SV-COMP programs (blue) vs. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: [RQ.1] Lemmas proved by each tool on CoqGym and SV-COMP programs. On both benchmarks, AutoRocq is able to prove more lemmas, and has the most number of uniquely proved lemmas. 0-99 100-149 150-199 200-299 300+ 0 20 40 60 80 # of Proved Lemmas AutoRocq Rango PALM QEDC P9001 (a) Breakdown by term count. 0-9 10-14 15-19 20-29 30+ 0 20 40 60 80 # of Proved Lemmas AutoRocq Rango PALM QEDC P9001 (b) Breakdown by… view at source ↗

**Figure 5.** Figure 5: [RQ.1] # of lemmas proved from SV-COMP programs: a breakdown by lemmas’ complexity. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: [RQ.1] # of lemmas proved from SV-COMP programs: a breakdown by the category of lemmas. Search About Print Check Locate 10 1 10 2 10 3 10 4 Log Count [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 8.** Figure 8: [RQ.3] Proof for wp_goal in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Automatically generated code is gaining traction recently, owing to the prevalence of Large Language Models (LLMs). Further, the AlphaProof initiative has demonstrated the possibility of using AI for general mathematical reasoning. Reasoning about computer programs (software) can be accomplished via general mathematical reasoning; however, it tends to be more structured and richer in contexts. This forms an attractive proposition, since then AI agents can be used to reason about voluminous code that gets generated by AI. In this work, we present a first LLM agent, AutoRocq, for conducting program verification. Unlike past works, which rely on extensive training of LLMs on proof examples, our agent learns on-the-fly and improves the proof via an iterative refinement loop. The iterative improvement of the proof is achieved by the proof agent communicating with the Rocq (formerly Coq) theorem prover to get additional context and feedback. The final result of the iteration is a proof derivation checked by the Rocq theorem prover. In this way, our proof construction involves autonomous collaboration between the proof agent and the theorem prover. This autonomy facilitates the search for proofs and decision-making in deciding on the structure of the proof tree. Experimental evaluation on SV-COMP benchmarks and on Linux kernel modules shows promising efficacy in achieving automated program verification. As automation in code generation becomes more widespread, we posit that our proof agent can be potentially integrated with AI coding agents to achieve a generate and validate loop, thus moving closer to the vision of trusted automatic programming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AutoRocq, an LLM-based proof agent for automated program verification. Unlike prior approaches that require extensive training on proof corpora, AutoRocq learns on-the-fly by maintaining an iterative refinement loop in which the agent queries the Rocq theorem prover for context and feedback, then updates the proof tree accordingly. The process terminates with a Rocq-checked derivation; the authors claim this yields autonomous collaboration between agent and prover. Experimental evaluation on SV-COMP benchmarks and Linux kernel modules is reported to demonstrate promising efficacy, with suggested future integration into AI code-generation pipelines to realize a generate-and-validate loop.

Significance. If the iterative agent-prover loop can be shown to converge reliably on non-trivial programs with limited human steering, the work would offer a practical bridge between LLM-driven code generation and formal verification. The on-the-fly, feedback-driven construction of proof trees without pre-training on proof examples is a distinguishing design choice that could scale to the volume of AI-generated code.

major comments (2)

[Abstract] Abstract: the statement that evaluation on SV-COMP benchmarks and Linux kernel modules 'shows promising efficacy' is unsupported by any quantitative data (success rates, iteration depths, failure modes, human-intervention counts, or baseline comparisons). Because the central claim of effective automated verification rests on these experiments, the absence of concrete results leaves the efficacy assertion weakly evidenced.
[Experimental Evaluation] Experimental Evaluation (or equivalent results section): the manuscript must report, for each benchmark suite, the fraction of tasks for which the iterative loop produced a complete Rocq-accepted proof, the distribution of iteration counts, the frequency and nature of non-convergence or timeouts, and any human guidance supplied. Without these metrics the claim that the loop 'reliably' produces verified derivations for non-trivial programs cannot be assessed.

minor comments (2)

Provide a precise description of the agent's state representation and the exact protocol by which Rocq feedback is translated into updates to the proof tree.
Clarify whether the agent is permitted to backtrack over earlier proof steps or is restricted to forward refinement only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need to strengthen the presentation of experimental results. We address each major comment below and will revise the manuscript accordingly to provide the requested quantitative details.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that evaluation on SV-COMP benchmarks and Linux kernel modules 'shows promising efficacy' is unsupported by any quantitative data (success rates, iteration depths, failure modes, human-intervention counts, or baseline comparisons). Because the central claim of effective automated verification rests on these experiments, the absence of concrete results leaves the efficacy assertion weakly evidenced.

Authors: We agree that the abstract claim would be better supported by concrete metrics. The current abstract summarizes the overall outcome of the evaluation; the full Experimental Evaluation section provides case studies and qualitative observations from the SV-COMP and Linux kernel benchmarks. In the revision we will either temper the abstract phrasing or add a concise summary of key quantitative indicators (e.g., overall success fraction and average iteration depth) so that the abstract is consistent with the detailed results that will be expanded. revision: yes
Referee: [Experimental Evaluation] Experimental Evaluation (or equivalent results section): the manuscript must report, for each benchmark suite, the fraction of tasks for which the iterative loop produced a complete Rocq-accepted proof, the distribution of iteration counts, the frequency and nature of non-convergence or timeouts, and any human guidance supplied. Without these metrics the claim that the loop 'reliably' produces verified derivations for non-trivial programs cannot be assessed.

Authors: We accept this recommendation. The present Experimental Evaluation section emphasizes the agent-prover interaction protocol and selected verification examples rather than aggregated statistics. We will expand the section with tables reporting, for each suite: (i) the fraction of tasks that reached a Rocq-accepted proof, (ii) summary statistics or distributions of iteration counts, (iii) counts and categorization of non-convergent or timed-out cases, and (iv) any human guidance that was provided. These additions will allow readers to evaluate the reliability of the iterative loop directly. revision: yes

Circularity Check

0 steps flagged

No circularity: AutoRocq uses external Rocq feedback for independent proof checking

full rationale

The paper describes a practical agentic system in which an LLM proof agent iteratively refines proof scripts by querying the external Rocq theorem prover for context, error messages, and acceptance. The final output is a Rocq-verified derivation; this external oracle supplies the acceptance criterion rather than any quantity defined inside the paper. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the abstract or claims. The experimental results on SV-COMP and Linux kernel modules are presented as empirical outcomes of the loop, not as quantities that reduce to the method's own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The contribution is an engineering system rather than a formal derivation; no free parameters, mathematical axioms, or invented physical entities are introduced beyond the agent itself.

invented entities (1)

AutoRocq proof agent no independent evidence
purpose: Autonomous collaboration with Rocq for on-the-fly proof construction
The agent is the core new artifact; no independent falsifiable evidence outside the described experiments is provided.

pith-pipeline@v0.9.0 · 5578 in / 1083 out tokens · 61107 ms · 2026-05-17T20:16:29.651660+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The iterative improvement of the proof is achieved by the proof agent communicating with the Rocq (formerly Coq) theorem prover to get additional context and feedback. The final result of the iteration is a proof derivation checked by the Rocq theorem prover.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AutoRocq explicitly maintains a (partial) proof tree as it progresses in tactic generation... the proof tree is updated by extending the current node.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Certified Program Synthesis with a Multi-Modal Verifier
cs.SE 2026-04 unverdicted novelty 7.0

LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, out...

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 1 Pith paper

[1]

[n. d.]. BusyBox. https://busybox.net/

work page
[2]

[n. d.]. Evolved Value Analysis (Eva) Plug-in in Frama-C. https://frama-c.com/fc-plugins/eva.html

work page
[3]

[n. d.]. Runtime Error (RTE) Plug-in in Frama-C. https://www.frama-c.com/fc-plugins/rte.html

work page
[4]

[n. d.]. Weakest Precondition (WP) Plug-in in Frama-C. https://www.frama-c.com/fc-plugins/wp.html

work page
[5]

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics

work page 2021
[6]

Haniel Barbosa, Clark Barrett, Martin Brain, Gereon Kremer, Hanna Lachnitt, Makai Mann, Abdalrhman Mohamed, Mudathir Mohamed, Aina Niemetz, Andres Nötzli, et al. 2022. cvc5: A versatile and industrial-strength SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 415–442

work page 2022
[7]

Dirk Beyer and Jan Strejček. 2025. Improvements in Software Verification and Witness Validation: SV-COMP 2025. In Tools and Algorithms for the Construction and Analysis of Systems, Arie Gurfinkel and Marijn Heule (Eds.). Springer Nature Switzerland, Cham, 151–186

work page 2025
[8]

Lasse Blaauwbroek, Josef Urban, and Herman Geuvers. 2020. The tactician: A seamless, interactive tactic learner and prover for coq. InInternational Conference on Intelligent Computer Mathematics. Springer, 271–277

work page 2020
[9]

Pedro Carrott, Nuno Saavedra, Kyle Thompson, Sorin Lerner, João F Ferreira, and Emily First. 2024. CoqPyt: Proof Navigation in Python in the Era of LLMs. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 637–641

work page 2024
[10]

Frans Kaashoek, and Nickolai Zeldovich

Tej Chajed, Joseph Tassarotti, Mark Theng, M. Frans Kaashoek, and Nickolai Zeldovich. 2022. Verifying the DaisyNFS concurrent and crash-safe file system with sequential reasoning. In16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, Marcos K. Aguilera and Hakim Weatherspoon (Eds.). USENIX ...

work page 2022
[11]

Projet Coq. 1996. The coq proof assistant-reference manual.INRIA Rocquencourt and ENS Lyon, version5 (1996), 7–1

work page 1996
[12]

Pascal Cuoq, Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2012. Frama-C: A software analysis perspective. InInternational Conference on Software Engineering and Formal Methods (SEFM). Springer, 233–247

work page 2012
[13]

Łukasz Czajka and Cezary Kaliszyk. 2018. Hammer for Coq: Automation for dependent type theory.Journal of automated reasoning61, 1 (2018), 423–453

work page 2018
[14]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340

work page 2008
[15]

Leonardo de Moura and Sebastian Ullrich. 2021. The Lean 4 Theorem Prover and Programming Language. In Automated Deduction - CADE 28 - 28th International Conference on Automated Deduction, Virtual Event, July 12-15, 2021, Proceedings (Lecture Notes in Computer Science, Vol. 12699), André Platzer and Geoff Sutcliffe (Eds.). Springer, 625–635. doi:10.1007/97...

work page doi:10.1007/978-3-030-79876-5_37 2021
[16]

Leonardo Mendonça de Moura and Nikolaj S. Bjørner. 2008. Z3: An Efficient SMT Solver. InTools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings (Lecture ...

work page doi:10.1007/978-3-540-78800-3_24 2008
[17]

Google DeepMind. [n. d.]. AlphaProof. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver- medal-level/

work page
[18]

Denis Efremov, Mikhail Mandrykin, and Alexey Khoroshilov. 2018. Deductive verification of unmodified Linux kernel library functions. InInternational Symposium on Leveraging Applications of Formal Methods. Springer, 216–234

work page 2018
[19]

Emily First, Yuriy Brun, and Arjun Guha. 2020. TacTok: Semantics-aware proof synthesis.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–31

work page 2020
[20]

Madhusudan, and Dan Roth

Pranav Garg, Daniel Neider, P. Madhusudan, and Dan Roth. 2016. Learning invariants using decision trees and implication counterexamples. InProceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(St. Petersburg, FL, USA)(POPL ’16). Association for Computing Machinery, New York, NY, USA, 499–512. doi:10.1145/283761...

work page doi:10.1145/2837614.2837664 2016
[21]

Yeting Ge and Leonardo De Moura. 2009. Complete instantiation for quantified formulas in satisfiabiliby modulo theories. InComputer Aided Verification: 21st International Conference, CA V 2009, Grenoble, France, June 26-July 2, 2009. Proceedings 21. Springer, 306–320

work page 2009
[22]

2025.Proof tree visualization for Proof General

Hendrik. 2025.Proof tree visualization for Proof General. https://askra.de/software/prooftree/

work page 2025
[23]

Jones, John W

Robert B. Jones, John W. O’Leary, Carl-Johan H. Seger, Mark D. Aagaard, and Thomas F. Melham. 2001. Practical Formal Verification in Microprocessor Design.IEEE Des. Test Comput.18, 4 (2001), 16–25. doi:10.1109/54.936245

work page doi:10.1109/54.936245 2001
[24]

Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2015. Frama-C: A software analysis perspective.Formal Aspects Comput.27, 3 (2015), 573–609. doi:10.1007/S00165-014-0326-7

work page doi:10.1007/s00165-014-0326-7 2015
[25]

seL4: formal verification of an OS kernel,

Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David A. Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. 2009. seL4: formal verification of an OS kernel.. InProceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, ...

work page doi:10.1145/1629575.1629596 2009
[26]

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. 2022. Hypertree proof search for neural theorem proving.Advances in neural information processing systems35 (2022), 26337–26349

work page 2022
[27]

Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel. 2023. Verus: Verifying Rust Programs using Linear Ghost Types (extended version).CoRR abs/2303.05491 (2023). arXiv:2303.05491 doi:10.48550/ARXIV.2303.05491

work page doi:10.48550/arxiv.2303.05491 2023
[28]

In: Clarke, E.M., Voronkov, A

K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InLogic for Programming, Artificial Intelligence, and Reasoning - 16th International Conference, LPAR-16, Dakar, Senegal, April 25-May 1, 2010, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 6355), Edmund M. Clarke and Andrei Voronkov (Eds.). Sprin...

work page doi:10.1007/978-3-642-17511-4_20 2010
[29]

Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bernhard Schommer, Markus Pister, and Christian Ferdinand. 2016. CompCert-a formally verified optimizing compiler. InERTS 2016: Embedded Real Time Software and Systems, 8th European Congress

work page 2016
[30]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

work page 2020
[31]

Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si. [n. d.]. A Survey on Deep Learning for Theorem Proving. InFirst Conference on Language Modeling

work page
[32]

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. 2025. DafnyBench: A Benchmark for Formal Software Verification.Trans. Mach. Learn. Res.2025 (2025). https://openreview.net/forum?id=yBgTVWccIx

work page 2025
[34]

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof automation with large language models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1509–1520

work page 2024
[35]

Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. 2025. SpecGen: Automated Generation of Formal Program Specifications via Large Language Models . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 16–28. doi:10.1109/ICSE55347.2025.00129 , Vol. 1, No. 1, Article . Publication ...

work page doi:10.1109/icse55347.2025.00129 2025
[36]

Gregory Malecha, Greg Morrisett, Avraham Shinnar, and Ryan Wisnesky. 2010. Toward a verified relational database management system. InProceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(Madrid, Spain)(POPL ’10). Association for Computing Machinery, New York, NY, USA, 237–248. doi:10. 1145/1706299.1706329

work page arXiv 2010
[37]

Lopes, Iris Ma, and James Noble

Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, and James Noble. 2024. Towards AI-Assisted Synthesis of Verified Dafny Methods.Proc. ACM Softw. Eng.1, FSE (2024), 812–835. doi:10.1145/3643763

work page doi:10.1145/3643763 2024
[38]

Peter Müller, Malte Schwerhoff, and Alexander J. Summers. 2016. Viper: A Verification Infrastructure for Permission- Based Reasoning. InVerification, Model Checking, and Abstract Interpretation - 17th International Conference, VMCAI 2016, St. Petersburg, FL, USA, January 17-19, 2016. Proceedings (Lecture Notes in Computer Science, Vol. 9583), Barbara Jobs...

work page doi:10.1007/978-3-662-49122-5_2 2016
[39]

1994.Isabelle: A generic theorem prover

Lawrence C Paulson. 1994.Isabelle: A generic theorem prover. Springer

work page 1994
[40]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2025. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun. ACM68, 2 (2025), 96–105

work page 2025
[41]

Alex Sanchez-Stern, Yousef Alhessi, Lawrence Saul, and Sorin Lerner. 2020. Generating correctness proofs with neural networks. InProceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 1–10

work page 2020
[42]

Alex Sanchez-Stern, Emily First, Timothy Zhou, Zhanna Kaufman, Yuriy Brun, and Talia Ringer. 2023. Passport: Improving automated formal verification using identifiers.ACM Transactions on Programming Languages and Systems 45, 2 (2023), 1–30

work page 2023
[43]

Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, and Yuriy Brun. 2024. QED- Cartographer: Automating formal verification using reward-free reinforcement learning. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 405–418

work page 2024
[44]

Xujie Si, Hanjun Dai, Mukund Raghothaman, Mayur Naik, and Le Song. 2018. Learning loop invariants for program verification.Advances in Neural Information Processing Systems31 (2018)

work page 2018
[45]

Xujie Si, Aaditya Naik, Hanjun Dai, Mayur Naik, and Le Song. 2020. Code2Inv: A Deep Learning Framework for Program Verification. InComputer Aided Verification - 32nd International Conference, CA V 2020, Los Angeles, CA, USA, July 21-24, 2020, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12225), Shuvendu K. Lahiri and Chao Wang (Eds.). Spr...

work page doi:10.1007/978-3-030-53291-8_9 2020
[46]

Matthieu Sozeau, Yannick Forster, Meven Lennon-Bertrand, Jakob Nielsen, Nicolas Tabareau, and Théo Winterhalter

work page
[47]

Correct and Complete Type Checking and Certified Erasure for Coq, in Coq , journal =

Correct and Complete Type Checking and Certified Erasure for Coq, in Coq.J. ACM72, 1, Article 8 (Jan. 2025), 74 pages. doi:10.1145/3706056

work page doi:10.1145/3706056 2025
[48]

2025.Search Comand in Rocq Documentation

Rocq Development Team. 2025.Search Comand in Rocq Documentation. https://rocq-prover.org/doc/V9.0.0/refman/ proof-engine/vernacular-commands.html#coq:cmd.Search

work page 2025
[49]

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, Joao F. Ferreira, Sorin Lerner, and Emily First. 2025. Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification . In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 347–...

work page doi:10.1109/icse55347.2025.00161 2025
[50]

Grigoriy Volkov, Mikhail Mandrykin, and Denis Efremov. 2018. Lemma functions for Frama-C: C programs as proofs. In2018 Ivannikov Ispras Open Conference (ISPRAS). IEEE, 31–38

work page 2018
[51]

Minchao Wu, Michael Norrish, Christian Walder, and Amir Dezfouli. 2021. TacticZero: Learning to Prove Theorems from Scratch with Deep Reinforcement Learning. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygel...

work page 2021
[52]

Kaiyu Yang and Jia Deng. 2019. Learning to prove theorems via interacting with proof assistants. InInternational Conference on Machine Learning. PMLR, 6984–6994

work page 2019
[53]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

work page 2024
[54]

Naiqian Zheng, Mengqi Liu, Yuxing Xiang, Linjian Song, Dong Li, Feng Han, Nan Wang, Yong Ma, Zhuo Liang, Dennis Cai, Ennan Zhai, Xuanzhe Liu, and Xin Jin. 2023. Automated Verification of an In-Production DNS Authoritative Engine. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, Jason Fl...

work page doi:10.1145/3600006.3613153 2023

[1] [1]

[n. d.]. BusyBox. https://busybox.net/

work page

[2] [2]

[n. d.]. Evolved Value Analysis (Eva) Plug-in in Frama-C. https://frama-c.com/fc-plugins/eva.html

work page

[3] [3]

[n. d.]. Runtime Error (RTE) Plug-in in Frama-C. https://www.frama-c.com/fc-plugins/rte.html

work page

[4] [4]

[n. d.]. Weakest Precondition (WP) Plug-in in Frama-C. https://www.frama-c.com/fc-plugins/wp.html

work page

[5] [5]

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics

work page 2021

[6] [6]

Haniel Barbosa, Clark Barrett, Martin Brain, Gereon Kremer, Hanna Lachnitt, Makai Mann, Abdalrhman Mohamed, Mudathir Mohamed, Aina Niemetz, Andres Nötzli, et al. 2022. cvc5: A versatile and industrial-strength SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 415–442

work page 2022

[7] [7]

Dirk Beyer and Jan Strejček. 2025. Improvements in Software Verification and Witness Validation: SV-COMP 2025. In Tools and Algorithms for the Construction and Analysis of Systems, Arie Gurfinkel and Marijn Heule (Eds.). Springer Nature Switzerland, Cham, 151–186

work page 2025

[8] [8]

Lasse Blaauwbroek, Josef Urban, and Herman Geuvers. 2020. The tactician: A seamless, interactive tactic learner and prover for coq. InInternational Conference on Intelligent Computer Mathematics. Springer, 271–277

work page 2020

[9] [9]

Pedro Carrott, Nuno Saavedra, Kyle Thompson, Sorin Lerner, João F Ferreira, and Emily First. 2024. CoqPyt: Proof Navigation in Python in the Era of LLMs. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 637–641

work page 2024

[10] [10]

Frans Kaashoek, and Nickolai Zeldovich

Tej Chajed, Joseph Tassarotti, Mark Theng, M. Frans Kaashoek, and Nickolai Zeldovich. 2022. Verifying the DaisyNFS concurrent and crash-safe file system with sequential reasoning. In16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, Marcos K. Aguilera and Hakim Weatherspoon (Eds.). USENIX ...

work page 2022

[11] [11]

Projet Coq. 1996. The coq proof assistant-reference manual.INRIA Rocquencourt and ENS Lyon, version5 (1996), 7–1

work page 1996

[12] [12]

Pascal Cuoq, Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2012. Frama-C: A software analysis perspective. InInternational Conference on Software Engineering and Formal Methods (SEFM). Springer, 233–247

work page 2012

[13] [13]

Łukasz Czajka and Cezary Kaliszyk. 2018. Hammer for Coq: Automation for dependent type theory.Journal of automated reasoning61, 1 (2018), 423–453

work page 2018

[14] [14]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340

work page 2008

[15] [15]

Leonardo de Moura and Sebastian Ullrich. 2021. The Lean 4 Theorem Prover and Programming Language. In Automated Deduction - CADE 28 - 28th International Conference on Automated Deduction, Virtual Event, July 12-15, 2021, Proceedings (Lecture Notes in Computer Science, Vol. 12699), André Platzer and Geoff Sutcliffe (Eds.). Springer, 625–635. doi:10.1007/97...

work page doi:10.1007/978-3-030-79876-5_37 2021

[16] [16]

Leonardo Mendonça de Moura and Nikolaj S. Bjørner. 2008. Z3: An Efficient SMT Solver. InTools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings (Lecture ...

work page doi:10.1007/978-3-540-78800-3_24 2008

[17] [17]

Google DeepMind. [n. d.]. AlphaProof. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver- medal-level/

work page

[18] [18]

Denis Efremov, Mikhail Mandrykin, and Alexey Khoroshilov. 2018. Deductive verification of unmodified Linux kernel library functions. InInternational Symposium on Leveraging Applications of Formal Methods. Springer, 216–234

work page 2018

[19] [19]

Emily First, Yuriy Brun, and Arjun Guha. 2020. TacTok: Semantics-aware proof synthesis.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–31

work page 2020

[20] [20]

Madhusudan, and Dan Roth

Pranav Garg, Daniel Neider, P. Madhusudan, and Dan Roth. 2016. Learning invariants using decision trees and implication counterexamples. InProceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(St. Petersburg, FL, USA)(POPL ’16). Association for Computing Machinery, New York, NY, USA, 499–512. doi:10.1145/283761...

work page doi:10.1145/2837614.2837664 2016

[21] [21]

Yeting Ge and Leonardo De Moura. 2009. Complete instantiation for quantified formulas in satisfiabiliby modulo theories. InComputer Aided Verification: 21st International Conference, CA V 2009, Grenoble, France, June 26-July 2, 2009. Proceedings 21. Springer, 306–320

work page 2009

[22] [22]

2025.Proof tree visualization for Proof General

Hendrik. 2025.Proof tree visualization for Proof General. https://askra.de/software/prooftree/

work page 2025

[23] [23]

Jones, John W

Robert B. Jones, John W. O’Leary, Carl-Johan H. Seger, Mark D. Aagaard, and Thomas F. Melham. 2001. Practical Formal Verification in Microprocessor Design.IEEE Des. Test Comput.18, 4 (2001), 16–25. doi:10.1109/54.936245

work page doi:10.1109/54.936245 2001

[24] [24]

Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2015. Frama-C: A software analysis perspective.Formal Aspects Comput.27, 3 (2015), 573–609. doi:10.1007/S00165-014-0326-7

work page doi:10.1007/s00165-014-0326-7 2015

[25] [25]

seL4: formal verification of an OS kernel,

Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David A. Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. 2009. seL4: formal verification of an OS kernel.. InProceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, ...

work page doi:10.1145/1629575.1629596 2009

[26] [26]

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. 2022. Hypertree proof search for neural theorem proving.Advances in neural information processing systems35 (2022), 26337–26349

work page 2022

[27] [27]

Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel. 2023. Verus: Verifying Rust Programs using Linear Ghost Types (extended version).CoRR abs/2303.05491 (2023). arXiv:2303.05491 doi:10.48550/ARXIV.2303.05491

work page doi:10.48550/arxiv.2303.05491 2023

[28] [28]

In: Clarke, E.M., Voronkov, A

K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InLogic for Programming, Artificial Intelligence, and Reasoning - 16th International Conference, LPAR-16, Dakar, Senegal, April 25-May 1, 2010, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 6355), Edmund M. Clarke and Andrei Voronkov (Eds.). Sprin...

work page doi:10.1007/978-3-642-17511-4_20 2010

[29] [29]

Xavier Leroy, Sandrine Blazy, Daniel Kästner, Bernhard Schommer, Markus Pister, and Christian Ferdinand. 2016. CompCert-a formally verified optimizing compiler. InERTS 2016: Embedded Real Time Software and Systems, 8th European Congress

work page 2016

[30] [30]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

work page 2020

[31] [31]

Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si. [n. d.]. A Survey on Deep Learning for Theorem Proving. InFirst Conference on Language Modeling

work page

[32] [32]

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. 2025. DafnyBench: A Benchmark for Formal Software Verification.Trans. Mach. Learn. Res.2025 (2025). https://openreview.net/forum?id=yBgTVWccIx

work page 2025

[33] [34]

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof automation with large language models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1509–1520

work page 2024

[34] [35]

Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. 2025. SpecGen: Automated Generation of Formal Program Specifications via Large Language Models . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 16–28. doi:10.1109/ICSE55347.2025.00129 , Vol. 1, No. 1, Article . Publication ...

work page doi:10.1109/icse55347.2025.00129 2025

[35] [36]

Gregory Malecha, Greg Morrisett, Avraham Shinnar, and Ryan Wisnesky. 2010. Toward a verified relational database management system. InProceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(Madrid, Spain)(POPL ’10). Association for Computing Machinery, New York, NY, USA, 237–248. doi:10. 1145/1706299.1706329

work page arXiv 2010

[36] [37]

Lopes, Iris Ma, and James Noble

Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, and James Noble. 2024. Towards AI-Assisted Synthesis of Verified Dafny Methods.Proc. ACM Softw. Eng.1, FSE (2024), 812–835. doi:10.1145/3643763

work page doi:10.1145/3643763 2024

[37] [38]

Peter Müller, Malte Schwerhoff, and Alexander J. Summers. 2016. Viper: A Verification Infrastructure for Permission- Based Reasoning. InVerification, Model Checking, and Abstract Interpretation - 17th International Conference, VMCAI 2016, St. Petersburg, FL, USA, January 17-19, 2016. Proceedings (Lecture Notes in Computer Science, Vol. 9583), Barbara Jobs...

work page doi:10.1007/978-3-662-49122-5_2 2016

[38] [39]

1994.Isabelle: A generic theorem prover

Lawrence C Paulson. 1994.Isabelle: A generic theorem prover. Springer

work page 1994

[39] [40]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2025. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun. ACM68, 2 (2025), 96–105

work page 2025

[40] [41]

Alex Sanchez-Stern, Yousef Alhessi, Lawrence Saul, and Sorin Lerner. 2020. Generating correctness proofs with neural networks. InProceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 1–10

work page 2020

[41] [42]

Alex Sanchez-Stern, Emily First, Timothy Zhou, Zhanna Kaufman, Yuriy Brun, and Talia Ringer. 2023. Passport: Improving automated formal verification using identifiers.ACM Transactions on Programming Languages and Systems 45, 2 (2023), 1–30

work page 2023

[42] [43]

Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, and Yuriy Brun. 2024. QED- Cartographer: Automating formal verification using reward-free reinforcement learning. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 405–418

work page 2024

[43] [44]

Xujie Si, Hanjun Dai, Mukund Raghothaman, Mayur Naik, and Le Song. 2018. Learning loop invariants for program verification.Advances in Neural Information Processing Systems31 (2018)

work page 2018

[44] [45]

Xujie Si, Aaditya Naik, Hanjun Dai, Mayur Naik, and Le Song. 2020. Code2Inv: A Deep Learning Framework for Program Verification. InComputer Aided Verification - 32nd International Conference, CA V 2020, Los Angeles, CA, USA, July 21-24, 2020, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12225), Shuvendu K. Lahiri and Chao Wang (Eds.). Spr...

work page doi:10.1007/978-3-030-53291-8_9 2020

[45] [46]

Matthieu Sozeau, Yannick Forster, Meven Lennon-Bertrand, Jakob Nielsen, Nicolas Tabareau, and Théo Winterhalter

work page

[46] [47]

Correct and Complete Type Checking and Certified Erasure for Coq, in Coq , journal =

Correct and Complete Type Checking and Certified Erasure for Coq, in Coq.J. ACM72, 1, Article 8 (Jan. 2025), 74 pages. doi:10.1145/3706056

work page doi:10.1145/3706056 2025

[47] [48]

2025.Search Comand in Rocq Documentation

Rocq Development Team. 2025.Search Comand in Rocq Documentation. https://rocq-prover.org/doc/V9.0.0/refman/ proof-engine/vernacular-commands.html#coq:cmd.Search

work page 2025

[48] [49]

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, Joao F. Ferreira, Sorin Lerner, and Emily First. 2025. Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification . In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 347–...

work page doi:10.1109/icse55347.2025.00161 2025

[49] [50]

Grigoriy Volkov, Mikhail Mandrykin, and Denis Efremov. 2018. Lemma functions for Frama-C: C programs as proofs. In2018 Ivannikov Ispras Open Conference (ISPRAS). IEEE, 31–38

work page 2018

[50] [51]

Minchao Wu, Michael Norrish, Christian Walder, and Amir Dezfouli. 2021. TacticZero: Learning to Prove Theorems from Scratch with Deep Reinforcement Learning. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygel...

work page 2021

[51] [52]

Kaiyu Yang and Jia Deng. 2019. Learning to prove theorems via interacting with proof assistants. InInternational Conference on Machine Learning. PMLR, 6984–6994

work page 2019

[52] [53]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

work page 2024

[53] [54]

Naiqian Zheng, Mengqi Liu, Yuxing Xiang, Linjian Song, Dong Li, Feng Han, Nan Wang, Yong Ma, Zhuo Liang, Dennis Cai, Ennan Zhai, Xuanzhe Liu, and Xin Jin. 2023. Automated Verification of an In-Production DNS Authoritative Engine. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, Jason Fl...

work page doi:10.1145/3600006.3613153 2023