pith. sign in

arxiv: 2605.26017 · v1 · pith:GQR4MCM4new · submitted 2026-05-25 · 💻 cs.SE

Trustworthy Software Project Generation : a Case Study with an Interactive Theorem Prover

Pith reviewed 2026-06-29 20:15 UTC · model grok-4.3

classification 💻 cs.SE
keywords verified code generationinteractive theorem proverRocqRISC-V interpreterLLM agentproof repairsoftware project synthesisCPU interpreter
0
0 comments X

The pith

LLM agent uses Rocq to autonomously generate a full verified RISC-V RV32I interpreter project with no human intervention after requirements are given.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether interactive theorem provers enable reliable generation of entire complex software projects by LLMs, in contrast to smaller tasks where errors often persist. It tests this through automatic creation of a CPU interpreter covering all 47 instructions of the RISC-V RV32I base. The agent implements effects in the target language, proves pure logic in the ITP, extracts the result, and integrates it into a complete project. Success occurs with Rocq but not Dafny, because Rocq supplies concrete proof states that allow the agent to repair failures autonomously. This route matters if it can reduce the risk of subtle errors in generated code for larger systems.

Core claim

With Rocq as the backend, the agent completes the project within 30 minutes, producing 1,859 lines of verified Rocq and extracting 2,848 lines of C++. The resulting interpreter passes all 265 LLM-generated tests covering the 47 instructions and exhibits zero crashes and zero hangs during 12 hours of AFL++ fuzzing. Under the same configuration, a Dafny-based backend fails to complete verification. The observation is that Rocq exposes a concrete proof state when a proof attempt fails, giving the agent actionable feedback for repair.

What carries the argument

Separation of effectful code (implemented directly in the target language) from pure total functions (proved in the ITP and extracted for integration), together with Rocq's exposure of concrete proof state for autonomous LLM-driven repair.

If this is right

  • Full instruction-set interpreters can be generated and verified automatically in under an hour.
  • ITP backends outperform Dafny-style systems for LLM synthesis because they supply explicit proof states for repair.
  • The generated project meets both functional tests and long-running fuzzing with no observed failures.
  • Synthesis, proof repair, extraction, and integration can all proceed without intervention once requirements are supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of effects and pure logic could be applied to other mixed systems such as device drivers or protocol stacks.
  • Agents might scale to larger verified components if the proof-state feedback loop is retained across multiple extraction targets.
  • The results point toward testing whether the approach succeeds on other base architectures beyond RV32I.

Load-bearing premise

The LLM agent can autonomously interpret Rocq's concrete proof state to repair failed proofs without human intervention, and separating effectful code from pure total functions suffices for a working integrated project.

What would settle it

A repeated trial in which the agent receives the same RISC-V RV32I requirements, runs under identical conditions, yet either fails to finish verification or produces extracted code that fails the test suite or shows crashes or hangs in fuzzing.

Figures

Figures reproduced from arXiv: 2605.26017 by Jian Fang, Yingfei Xiong.

Figure 1
Figure 1. Figure 1: CPU emulators of this kind are widely used in simulation and firmware development [19], [20]. Consider the require￾ment in the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: The SPDDWL workflow instantiated with Rocq. Instruction encoding (bit fields): | 31-20 | 19-15 | 14-12 | 11-7 | 6-0 | | offset[11:0] | rs1 | 000 | rd | 1100111 | Format: jalr rd, rs1, offset Description: Jump to address and place return address in rd . Implementation: t = pc+4; pc = (x[rs1] + sext(offset)) & ˜1; x[rd] = t [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The requirement for the JALR instruction. Function-level specifications: jalr_link rd receives low_bits XLEN (pc + 4) as the return address. jalr_pc The new pc equals (x[rs1] + sext(offset)) with bit 0 cleared. jalr_lsb_zero Bit 0 of the new pc is always zero after exec_JALR. Global specification: jalr_frame [frame] Memory, CSRs, halt flag, privilege mode, and trace are unchanged. Every register other than… view at source ↗
Figure 3
Figure 3. Figure 3: The coding plan’s logic specifications for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The exec_JALR implementation and the logic specifications pro￾duced by the agent. Definition step (s : state) (i : instr) : state := if halt s then s else match i with | LUI rd imm ⇒ pc_advance4 (exec_LUI s rd imm) | AUIPC rd imm ⇒ pc_advance4 (exec_AUIPC s rd imm) | ADDI rd rs1 imm ⇒ pc_advance4 (exec_ADDI s rd rs1 imm) ... (* The register file has exactly 32 entries. *) Lemma global_reg_length : forall s… view at source ↗
Figure 5
Figure 5. Figure 5: The step function and global specifications theorems for step. agent and a proving agent that generate and repair proof￾assistant codes, and an extraction step that translates accepted functional definitions into the target language. The workflow is defined for an ITP backend; this section describes workflow used in SPDDWL. A. Requirement analyzer The requirement analyzer is an LLM agent that takes the nat… view at source ↗
read the original abstract

Generating code from natural-language requirements has become a primary route for LLM-assisted software development. Although LLMs can successfully complete small programming tasks, generating an entire complex project remains unreliable because subtle errors may survive compilation and testing. Verified programming can reduce this risk by requiring generated implementations to satisfy machine-checked specifications. Existing explorations mostly target verification-oriented languages and toolchains such as Dafny and Frama-C, but directly generating project-scale verified code with these systems remains difficult. This paper studies whether interactive theorem provers (ITPs) can support large-scale software generation. Because ITPs handle pure total functions but not effects such as I/O, our agent separates effectful code from pure logic: it implements the effects in the target language, proves the pure logic in an ITP, and extracts it for integration into the final project. We study this route through the fully automatic development of a CPU interpreter for all 47 instructions of the unprivileged RISC-V RV32I base: after the requirements are supplied, no human intervenes in synthesis, proof repair, extraction, or integration. With Rocq as the backend, the agent completes the project within 30 minutes, producing 1,859 lines of verified Rocq and extracting 2,848 lines of C++. The resulting interpreter passes all 265 LLM-generated tests covering the 47 instructions and exhibits zero crashes and zero hangs during 12 hours of AFL++ fuzzing. Under the same configuration, a Dafny-based backend fails to complete verification. Our observation is that Rocq exposes a concrete proof state when a proof attempt fails, giving the agent actionable feedback for repair. These results provide empirical evidence that ITP-based verified programming is a feasible route for LLM-generated software projects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that interactive theorem provers (ITPs) such as Rocq enable fully automatic, human-intervention-free generation of complex verified software projects from natural-language requirements via an LLM agent. The agent separates effectful code (implemented in the target language) from pure total functions (proven in the ITP and extracted), demonstrated on a complete RISC-V RV32I interpreter covering all 47 instructions. With Rocq, the project completes in 30 minutes yielding 1,859 lines of verified Rocq code extracted to 2,848 lines of C++; the interpreter passes all 265 LLM-generated tests and shows zero crashes or hangs in 12 hours of AFL++ fuzzing. The Dafny backend fails under identical conditions. The central observation is that Rocq supplies concrete proof states that enable autonomous repair.

Significance. If the empirical outcomes hold, the work supplies concrete evidence that ITP-backed verified programming can overcome the unreliability of direct LLM code generation for project-scale artifacts. Strengths include machine-checked proofs in Rocq, extraction to executable C++, coverage by 265 tests, and rigorous fuzzing validation. This offers a reproducible route to trustworthy software and highlights an advantage of ITPs over verification-oriented languages like Dafny for LLM-driven synthesis.

major comments (2)
  1. [Agent–Rocq interaction / proof repair] The manuscript asserts that the LLM agent autonomously interprets Rocq's concrete proof state to repair failed proofs for all 47 instructions without human edits, yet provides no description of state serialization, the repair prompt template, or the number of repair attempts required per lemma. This mechanism is load-bearing for the central claim of fully automatic project generation (see abstract and the section on agent–Rocq interaction).
  2. [Integration and extraction] The claim that separating effectful C++ scaffolding from extracted pure total functions suffices for a working stateful interpreter is asserted but not supported by any explicit description of how state is maintained across instructions or how the integration was validated beyond the reported tests and fuzzing. This is central to the feasibility conclusion.
minor comments (1)
  1. [Title] The title contains an extraneous space before the colon; standard formatting would be 'Trustworthy Software Project Generation: a Case Study with an Interactive Theorem Prover'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments correctly identify areas where the manuscript would benefit from greater technical detail to substantiate the claims of fully automatic generation and integration. We respond to each point below and will incorporate the requested clarifications in the revision.

read point-by-point responses
  1. Referee: [Agent–Rocq interaction / proof repair] The manuscript asserts that the LLM agent autonomously interprets Rocq's concrete proof state to repair failed proofs for all 47 instructions without human edits, yet provides no description of state serialization, the repair prompt template, or the number of repair attempts required per lemma. This mechanism is load-bearing for the central claim of fully automatic project generation (see abstract and the section on agent–Rocq interaction).

    Authors: We agree that the manuscript does not currently provide these implementation-level details on proof repair, which are necessary for full reproducibility of the autonomous process. In the revised version we will add a dedicated subsection to the 'Agent–Rocq interaction' section that (1) describes the serialization of Rocq proof states (goals, hypotheses, and error messages) into LLM-readable text, (2) presents the structure of the repair prompt template, and (3) reports empirical statistics on repair attempts per lemma, including average number of attempts and first-try success rate across the 47 instructions. These additions directly address the load-bearing mechanism without altering the reported outcomes. revision: yes

  2. Referee: [Integration and extraction] The claim that separating effectful C++ scaffolding from extracted pure total functions suffices for a working stateful interpreter is asserted but not supported by any explicit description of how state is maintained across instructions or how the integration was validated beyond the reported tests and fuzzing. This is central to the feasibility conclusion.

    Authors: We acknowledge that the manuscript lacks an explicit description of the state-maintenance mechanism and additional validation steps. The effectful C++ scaffolding maintains mutable CPU state (registers and memory) that is passed to and updated by each extracted pure function. In the revision we will insert a new 'Integration and Extraction' subsection containing (1) a description and diagram of the state-passing interface, (2) a representative code example showing how an extracted function is invoked from the C++ scaffolding, and (3) details of supplementary validation (manual review of generated C++ for all 47 instructions and targeted state-transition checks). This will strengthen the feasibility argument while preserving the existing test and fuzzing results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical case study with independent outcomes

full rationale

The paper reports an empirical case study in which an LLM agent autonomously generates, verifies, extracts, and integrates a full RISC-V RV32I interpreter using Rocq. The headline results (30-minute completion, 1,859 lines of verified Rocq, 2,848 lines of extracted C++, passage of 265 tests, zero crashes/hangs in 12-hour fuzzing) are presented as observed outcomes of running the described process. No mathematical derivation chain, equations, fitted parameters renamed as predictions, or self-referential definitions appear. The separation of effectful scaffolding from pure total functions is stated as a methodological premise, not derived from the results themselves. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are quoted or required to support the central claim. The paper is therefore self-contained against external benchmarks (test suite and fuzzing) and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the domain assumption that ITPs can handle pure total functions while effects are handled separately in the target language, and that Rocq's proof state feedback enables autonomous repair for this project.

axioms (1)
  • domain assumption ITPs handle pure total functions but not effects such as I/O
    Explicitly stated as the reason for the separation strategy in the abstract.

pith-pipeline@v0.9.1-grok · 5845 in / 1350 out tokens · 35732 ms · 2026-06-29T20:15:21.938889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    Deepseek-coder: When the large language model meets programming – the rise of code intelligence,

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . K. Li, F. Luo, Y . Xiong, and W. Liang, “Deepseek-coder: When the large language model meets programming – the rise of code intelligence,” 2024. [Online]. Available: https://arxiv.org/abs/2401.14196

  2. [2]

    Evaluating large language models trained on code,

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

  3. [3]

    (2026) Claude code documentation: Overview

    Anthropic. (2026) Claude code documentation: Overview. Anthropic. [Online]. Available: https://code.claude.com/docs/en/overview

  4. [4]

    (2026) Cursor agents

    Anysphere. (2026) Cursor agents. Anysphere

  5. [5]

    The fm agent,

    A. Li, C. Wu, Z. Ge, Y . H. Chong, Z. Hou, L. Cao, C. Ju, J. Wu, H. Li, H. Zhang, S. Feng, M. Zhao, F. Qiu, R. Yang, M. Zhang, W. Zhu, Y . Sun, Q. Sun, S. Yan, D. Liu, D. Yin, and D. Shen, “The fm agent,” 2026. [Online]. Available: https://arxiv.org/abs/2510.26144

  6. [6]

    ACM Trans

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Trans. Inf. Syst., vol. 43, no. 2, pp. 42:1–42:55, 2025. [Online]. Available: https://doi.org/10.1145/3703155

  7. [7]

    Type-constrained code generation with language models,

    N. M ¨undler, J. He, H. Wang, K. Sen, D. Song, and M. T. Vechev, “Type-constrained code generation with language models,”Proc. ACM Program. Lang., vol. 9, no. PLDI, pp. 601–626, 2025. [Online]. Available: https://doi.org/10.1145/3729274

  8. [8]

    Evaluating llm-generated acsl annotations for formal verification,

    A. Beg, D. O’Donoghue, and R. Monahan, “Evaluating llm-generated acsl annotations for formal verification,” 2026. [Online]. Available: https://arxiv.org/abs/2602.13851

  9. [9]

    Dafny: An automatic program verifier for functional correctness,

    K. R. M. Leino, “Dafny: An automatic program verifier for functional correctness,” inLogic for Programming, Artificial Intelligence, and Reasoning - 16th International Conference, LPAR-16, Dakar, Senegal, April 25-May 1, 2010, Revised Selected Papers, ser. Lecture Notes in Computer Science, E. M. Clarke and A. V oronkov, Eds. Springer, 2010, pp. 348–370. ...

  10. [10]

    The Coq reference manual – release 9.0,

    The Coq Development Team, “The Coq reference manual – release 9.0,” 2025. [Online]. Available: https://rocq-prover.org/doc/v9.0/refman/ index.html

  11. [11]

    The lean 4 theorem prover and programming language,

    L. de Moura and S. Ullrich, “The lean 4 theorem prover and programming language,” inAutomated Deduction - CADE 28 - 28th International Conference on Automated Deduction, Virtual Event, July 12-15, 2021, Proceedings, ser. Lecture Notes in Computer Science, A. Platzer and G. Sutcliffe, Eds., vol. 12699. Springer, 2021, pp. 625–635. [Online]. Available: http...

  12. [12]

    The isabelle framework,

    M. Wenzel, L. C. Paulson, and T. Nipkow, “The isabelle framework,” inTheorem Proving in Higher Order Logics, 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings, ser. Lecture Notes in Computer Science, O. A. Mohamed, C. A. Mu ˜noz, and S. Tahar, Eds., vol. 5170. Springer, 2008, pp. 33–38. [Online]. Available: http...

  13. [13]

    Machine-generated, machine-checked proofs for a verified compiler (experience report),

    Z. Paraskevopoulou, “Machine-generated, machine-checked proofs for a verified compiler (experience report),” 2026. [Online]. Available: https://arxiv.org/abs/2602.20082

  14. [14]

    Introduction to the calculus of inductive construc- tions,

    C. Paulin-Mohring, “Introduction to the calculus of inductive construc- tions,”All about Proofs, Proofs for All, vol. 55, 2015

  15. [15]

    Interaction trees: representing recursive and impure programs in coq,

    L.-y. Xia, Y . Zakowski, P. He, C.-K. Hur, G. Malecha, B. C. Pierce, and S. Zdancewic, “Interaction trees: representing recursive and impure programs in coq,”Proc. ACM Program. Lang., vol. 4, no. POPL, Dec

  16. [16]

    Available: https://doi.org/10.1145/3371119

    [Online]. Available: https://doi.org/10.1145/3371119

  17. [17]

    Compcert – a formally verified optimizing compiler,

    X. Leroy, S. Blazy, D. K ¨astner, B. Schommer, M. Pister, and C. Ferdinand, “Compcert – a formally verified optimizing compiler,” in ERTS 2016: Embedded Real Time Software and Systems. SEE, 2016. [Online]. Available: http://xavierleroy.org/publi/erts2016 compcert.pdf

  18. [18]

    Certicoq: A verified compiler for coq,

    A. Anand, A. Appel, G. Morrisett, Z. Paraskevopoulou, R. Pollack, O. S. Belanger, M. Sozeau, and M. Weaver, “Certicoq: A verified compiler for coq,” inThe third international workshop on Coq for programming languages (CoqPL), 2017

  19. [19]

    Korkut and M

    J. Korkut and M. Weaver. (2026) crane. [Online]. Available: https://github.com/bloomberg/crane#maintainers

  20. [20]

    Qemu: a multihost, multitarget emulator,

    D. Bartholomew, “Qemu: a multihost, multitarget emulator,”Linux Journal, vol. 2006, no. 145, p. 3, 2006

  21. [21]

    Unicorn: Next generation cpu emulator framework,

    N. A. Quynh and D. H. Vu, “Unicorn: Next generation cpu emulator framework,”BlackHat USA, vol. 476, 2015

  22. [22]

    (2026) Unprivileged isa: Rv64

    RISC-V International. (2026) Unprivileged isa: Rv64. RISC-V International. RISC-V ISA documentation (online reference). [Online]. Available: https://docs.riscv.org/reference/isa/unpriv/rv64.html

  23. [23]

    Verified extraction from coq to ocaml,

    Y . Forster, M. Sozeau, and N. Tabareau, “Verified extraction from coq to ocaml,” vol. 8, no. PLDI, Jun. 2024. [Online]. Available: https://doi.org/10.1145/3656379

  24. [24]

    coq-of-ocaml,

    formal-land, “coq-of-ocaml,” 2026. [Online]. Available: https://github. com/formal-land/coq-of-ocaml

  25. [25]

    rocq-of-rust,

    ——, “rocq-of-rust,” 2026. [Online]. Available: https://github.com/ formal-land/rocq-of-rust

  26. [26]

    Verus: Verifying rust programs using linear ghost types,

    A. Lattuada, T. Hance, C. Cho, M. Brun, I. Subasinghe, Y . Zhou, J. Howell, B. Parno, and C. Hawblitzel, “Verus: Verifying rust programs using linear ghost types,”Proceedings of the ACM on Programming Languages, vol. 7, no. OOPSLA1, pp. 286–315, 2023

  27. [27]

    AFL++: Combining incremental steps of fuzzing research,

    A. Fioraldi, D. Maier, H. Eißfeldt, and M. Heuse, “AFL++: Combining incremental steps of fuzzing research,” in14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association, Aug. 2020

  28. [28]

    Gpt-4 technical report,

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  29. [29]

    Introducing claude sonnet 4.6,

    Anthropic, “Introducing claude sonnet 4.6,” https://www.anthropic.com/ news/claude-sonnet-4-6, 2026

  30. [30]

    Deepseek-v3.2: Pushing the frontier of open large lan- guage models,

    DeepSeek-AI, “Deepseek-v3.2: Pushing the frontier of open large lan- guage models,” 2025

  31. [31]

    SWE-bench: Can language models resolve real- world github issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real- world github issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=VTF8yNQM66

  32. [32]

    Composer 2 technical report,

    C. Research, :, A. Chan, A. Shalaby, A. Wettig, A. Sanger, A. Zhai, A. Ajay, A. Nair, C. Snell, C. Lu, C. Shen, E. Jia, F. Cassano, H. Liu, H. Chen, H. Wildermuth, J. Jackson, J. Li, J. Katz, J. Yao, J. Hejna, J. Warner, J. Vering, K. Frans, L. Danilek, L. Wright, L. Cen, L. Melas-Kyriazi, M. Truell, M. de Jong, N. Jain, N. Schmidt, N. Wang, N. Muennighof...

  33. [33]

    Claude code: An agentic coding tool,

    Anthropic, “Claude code: An agentic coding tool,” https://code.claude. com/, 2025

  34. [34]

    (2025) Opencode: Open source, terminal-based AI coding assistant

    OpenCode Team. (2025) Opencode: Open source, terminal-based AI coding assistant. Anomaly. [Online]. Available: https://opencode.ai

  35. [35]

    Bertot and P

    Y . Bertot and P. Cast ´eran,Interactive theorem proving and program development: Coq’Art: the calculus of inductive constructions. Springer Science & Business Media, 2013

  36. [36]

    seL4: formal verification of an os kernel

    G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. A. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, R. Kolanski, M. Norrish, T. Sewell, H. Tuch, and S. Winwood, “seL4: formal verification of an os kernel.” inProceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, Montana, USA, October 11-14, 2009, J. N. Matthews a...

  37. [37]

    CertiKOS: An extensible architecture for building certified concurrent os kernels,

    R. Gu, Z. Shao, H. Chen, X. N. Wu, J. Kim, V . Sj ¨oberg, and D. Costanzo, “CertiKOS: An extensible architecture for building certified concurrent os kernels,” in12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, K. Keeton and T. Roscoe, Eds. USENIX Association, 2016, pp. 653–669. [Onlin...

  38. [38]

    Hammer for coq: Automation for dependent type theory,

    L. Czajka and C. Kaliszyk, “Hammer for coq: Automation for dependent type theory,”J. Autom. Reason., vol. 61, no. 1-4, pp. 423–453, 2018. [Online]. Available: https://doi.org/10.1007/s10817-018-9458-4 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

  39. [39]

    Sledgehammer: Judgement day,

    S. B ¨ohme and T. Nipkow, “Sledgehammer: Judgement day,” in Automated Reasoning, 5th International Joint Conference, IJCAR 2010, Edinburgh, UK, July 16-19, 2010. Proceedings, ser. Lecture Notes in Computer Science, J. Giesl and R. H ¨ahnle, Eds., vol. 6173. Springer, 2010, pp. 107–121. [Online]. Available: https://doi.org/10.1007/978-3-642-14203-1\ 9

  40. [40]

    Lean-auto: An interface between lean 4 and automated theorem provers,

    Y . Qian, J. Clune, C. Barrett, and J. Avigad, “Lean-auto: An interface between lean 4 and automated theorem provers,” 2025. [Online]. Available: https://arxiv.org/abs/2505.14929

  41. [41]

    Learning to prove theorems via interacting with proof assistants,

    K. Yang and J. Deng, “Learning to prove theorems via interacting with proof assistants,” inProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 6984–6994. [Online]. Availa...

  42. [42]

    Generating correctness proofs with neural networks,

    A. Sanchez-Stern, Y . Alhessi, L. K. Saul, and S. Lerner, “Generating correctness proofs with neural networks,” inProceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL@PLDI 2020, London, UK, June 15, 2020, K. Sen and M. Naik, Eds. ACM, 2020, pp. 1–10. [Online]. Available: https://doi.org/10.1145/3394...

  43. [43]

    Graph2tac: Online representation learning of formal math concepts,

    L. Blaauwbroek, M. Ols ´ak, J. Rute, F. I. S. Massolo, J. Piepenbrock, and V . Pestun, “Graph2tac: Online representation learning of formal math concepts,” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=A7CtiozznN

  44. [44]

    Leandojo: Theorem proving with retrieval-augmented language models,

    K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and A. Anandkumar, “Leandojo: Theorem proving with retrieval-augmented language models,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. ...

  45. [45]

    Lego-prover: Neural theorem proving with growing libraries,

    H. Wang, H. Xin, C. Zheng, Z. Liu, Q. Cao, Y . Huang, J. Xiong, H. Shi, E. Xie, J. Yin, Z. Li, and X. Liang, “Lego-prover: Neural theorem proving with growing libraries,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id...

  46. [46]

    Proof automation with large language models,

    M. Lu, B. Delaware, and T. Zhang, “Proof automation with large language models,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024, V . Filkov, B. Ray, and M. Zhou, Eds. ACM, 2024, pp. 1509–1520. [Online]. Available: https://doi.org/10.1145/3691620.3695521

  47. [47]

    Cobblestone: A Divide-and-Conquer Approach for Automating Formal Verification

    S. R. Kasibatla, A. Agarwal, Y . Brun, S. Lerner, T. Ringer, and E. First, “Cobblestone: Iterative automation for formal verification,” CoRR, vol. abs/2410.19940, 2024. [Online]. Available: https://doi.org/ 10.48550/arXiv.2410.19940

  48. [48]

    Rango: Adaptive retrieval-augmented proving for automated software verification,

    K. Thompson, N. Saavedra, P. Carrott, K. Fisher, A. Sanchez-Stern, Y . Brun, J. F. Ferreira, S. Lerner, and E. First, “Rango: Adaptive retrieval-augmented proving for automated software verification,” in 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 2025, pp. 347–359. [Online]....