Large Lemma Miners: Can LLMs do Induction Proofs for Hardware?
Pith reviewed 2026-05-18 01:31 UTC · model grok-4.3
The pith
LLMs can produce provably correct inductive invariants for mid-size open-source RTL designs when given enough reprompting attempts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that LLMs, guided by two different prompting frameworks and paired with formal checking, can generate inductive arguments for hardware designs. With sufficient reprompting, the method produces a provably correct inductive argument for 84 percent of the mid-size open-source RTL designs in their test set.
What carries the argument
Neurosymbolic prompting frameworks that propose candidate invariants for later verification by a formal symbolic tool.
If this is right
- Formal verification engineers could offload part of the invariant-generation task to an LLM-assisted workflow.
- The time required to verify mid-size hardware modules would decrease once candidate invariants are produced automatically.
- The same combination of prompting and symbolic checking could be applied to additional open-source or similar-scale designs.
- Industrial hardware teams might integrate the method into existing verification pipelines to reduce manual effort.
Where Pith is reading between the lines
- Testing the same prompting setups on designs that exceed the current size range would reveal whether the success rate holds as complexity grows.
- Adapting the approach to software verification or other domains that rely on inductive reasoning could extend its reach beyond hardware.
- Combining the LLM proposals with additional automated proof strategies might further increase the fraction of problems solved without human input.
Load-bearing premise
The tested mid-size open-source RTL designs represent the kinds of industrial verification tasks that still require manual invariant generation today.
What would settle it
A single larger or proprietary design on which repeated prompting and checking never yields a correct inductive argument would falsify the reported success rate.
Figures
read the original abstract
Large Language Models (LLMs) have shown potential for solving mathematical tasks. We show that LLMs can be utilized to generate proofs by induction for hardware verification and thereby replace some of the manual work done by Formal Verification engineers and deliver industrial value. We present a neurosymbolic approach that includes two prompting frameworks to generate candidate invariants, which are checked using a formal, symbolic tool. Our results indicate that with sufficient reprompting, LLMs are able to generate inductive arguments for mid-size open-source RTL designs. For 84% of our problem set, at least one of the prompt setups succeeded in producing a provably correct inductive argument.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a neurosymbolic approach in which LLMs are used via two prompting frameworks to generate candidate inductive invariants for RTL hardware designs; these candidates are then validated by an external formal symbolic checker. The central empirical claim is that, with sufficient reprompting, the method produces provably correct inductive arguments for 84% of a problem set consisting of mid-size open-source RTL designs.
Significance. If the evaluation is sound, the work provides concrete evidence that LLMs can generate machine-checkable inductive lemmas for hardware verification tasks, offering a potential reduction in manual invariant engineering for mid-sized designs. The use of an external formal checker supplies an independent soundness guarantee that strengthens the experimental results. The industrial-value claim, however, rests on an untested assumption that the chosen open-source designs are representative of proprietary verification workloads.
major comments (2)
- [Evaluation] Evaluation section: the 84% success rate is reported for the problem set, but the text does not clarify whether this figure counts designs for which at least one of multiple independent prompting attempts succeeded or whether any post-hoc filtering of designs or prompts occurred. This detail is load-bearing for the reliability claim.
- [Results / Evaluation] §4 (or equivalent results section): no quantitative comparison is supplied between the evaluated mid-size open-source RTL designs and industrial benchmarks (e.g., register count, cone-of-influence size, or state-space cardinality). Without such data the extrapolation to “industrial value” and replacement of manual invariant work remains unsupported.
minor comments (2)
- [Method] The two prompting frameworks are described at a high level; a concrete example of a generated invariant together with the corresponding RTL fragment would improve reproducibility.
- [Related Work] The paper should cite prior work on LLM-assisted lemma generation in software verification (e.g., Dafny or Coq contexts) to better situate the hardware-specific contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments below and have revised the manuscript to improve clarity and support for our claims where possible.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the 84% success rate is reported for the problem set, but the text does not clarify whether this figure counts designs for which at least one of multiple independent prompting attempts succeeded or whether any post-hoc filtering of designs or prompts occurred. This detail is load-bearing for the reliability claim.
Authors: The manuscript abstract already states that 'For 84% of our problem set, at least one of the prompt setups succeeded in producing a provably correct inductive argument.' We agree this should be explicit in the Evaluation section as well. In the revised manuscript we have added text clarifying that the reported 84% is the fraction of designs for which at least one of the two prompting frameworks, after reprompting, produced a valid inductive argument. No post-hoc filtering of designs or prompts was performed; the entire problem set was evaluated uniformly. revision: yes
-
Referee: [Results / Evaluation] §4 (or equivalent results section): no quantitative comparison is supplied between the evaluated mid-size open-source RTL designs and industrial benchmarks (e.g., register count, cone-of-influence size, or state-space cardinality). Without such data the extrapolation to “industrial value” and replacement of manual invariant work remains unsupported.
Authors: We agree that quantitative metrics would strengthen the discussion of industrial relevance. The revised manuscript now includes additional data on the evaluated designs, such as register counts, module sizes, and estimated state-space cardinalities. Direct numerical comparison to proprietary industrial benchmarks is not possible, but we have added discussion noting that the chosen open-source RTL modules are representative of mid-sized verification problems in terms of complexity. revision: partial
Circularity Check
No circularity: empirical evaluation of LLM-generated invariants verified by external formal tools
full rationale
The paper reports an experimental neurosymbolic method in which LLMs propose candidate inductive invariants for RTL designs; these candidates are then checked for correctness by an independent symbolic formal verifier. The reported 84% success rate is obtained by direct measurement on a fixed problem set of open-source designs, with no fitted parameters, self-referential equations, or load-bearing self-citations that reduce the central claim to its own inputs. The derivation chain consists of prompting, candidate generation, and external verification; none of these steps collapses by construction into a renaming or re-use of the measured outcome itself. The evaluation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Formal symbolic tools can correctly decide inductiveness of candidate invariants for RTL designs
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present a neurosymbolic approach that includes two prompting frameworks to generate candidate invariants, which are checked using a formal, symbolic tool.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results indicate that with sufficient reprompting, LLMs are able to generate inductive arguments for mid-size open-source RTL designs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
t\"{a}k\={o}Formal: Enabling Robust Software for Programmable Memory Hierarchies (Extended Version)
An ISA-level memory consistency model for tākō is introduced and proven sound by verifying that executions of an implementation model are permitted by the model.
Reference graph
Works this paper leans on
- [1]
-
[2]
Advancing mathematics by guiding human intuition with AI,
A. Davies, P. Velickovic, L. Buesing, S. Blackwell, D. Zheng, N. Tomasev, R. Tanburn, P. W. Battaglia, C. Blundell, A. Juh ´asz, M. Lackenby, G. Williamson, D. Hassabis, and P. Kohli, “Advancing mathematics by guiding human intuition with AI,”Nat., vol. 600, no. 7887, pp. 70–74, 2021
work page 2021
-
[3]
Discovering faster matrix multipli- cation algorithms with reinforcement learning,
A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz, D. Silver, D. Hassabis, and P. Kohli, “Discovering faster matrix multipli- cation algorithms with reinforcement learning,”Nat., vol. 610, no. 7930, pp. 47–53, 2022
work page 2022
-
[4]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
A. Novikov, N. Vu, M. Eisenberger, E. Dupont, P. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog, “AlphaEvolve: A coding agent for scientific and algorithmic discovery,”CoRR, vol. abs/2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Global Lyapunov functions: A long-standing open problem in mathematics, with symbolic transformers,
A. Alfarano, F. Charton, and A. Hayat, “Global Lyapunov functions: A long-standing open problem in mathematics, with symbolic transformers,” inNeurIPS, 2024
work page 2024
-
[6]
Machine learning invariants of arithmetic curves,
Y . He, K. Lee, and T. Oliver, “Machine learning invariants of arithmetic curves,”J. Symb. Comput., vol. 115, pp. 478–491, 2023
work page 2023
-
[7]
Predicting root numbers with neural networks,
A. Pozdnyakov, “Predicting root numbers with neural networks,”Int. J. Data Sci. Math. Sci., vol. 2, no. 1, pp. 15–37, 2024
work page 2024
-
[8]
Learning Euler factors of elliptic curves,
A. Babei, F. Charton, E. Costa, X. Huang, K. Lee, D. Lowry-Duda, A. Narayanan, and A. Pozdnyakov, “Learning Euler factors of elliptic curves,”CoRR, vol. abs/2502.10357, 2025
-
[9]
Mathematical discoveries from program search with large language models,
B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. R. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi, P. Kohli, and A. Fawzi, “Mathematical discoveries from program search with large language models,”Nat., vol. 625, no. 7995, pp. 468–475, 2024
work page 2024
-
[10]
Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,
A. Q. Jiang, S. Welleck, J. P. Zhou, T. Lacroix, J. Liu, W. Li, M. Jamnik, G. Lample, and Y . Wu, “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,” inICLR. OpenReview.net, 2023
work page 2023
-
[11]
Lean-STaR: Learning to interleave thinking and proving,
H. Lin, Z. Sun, S. Welleck, and Y . Yang, “Lean-STaR: Learning to interleave thinking and proving,” inICLR. OpenReview.net, 2025
work page 2025
-
[12]
arXiv preprint arXiv:2410.07432 , year =
L. Pan, V . Ganesh, J. D. Abernethy, C. Esposo, and W. Lee, “Can transformers reason logically? A study in SAT solving,”CoRR, vol. abs/2410.07432, 2024
-
[13]
Purvis, B., Mao, Y., and Robinson, D
I. Petrov, J. Dekoninck, L. Baltadzhiev, M. Drencheva, K. Minchev, M. Balunovic, N. Jovanovic, and M. T. Vechev, “Proof or bluff? Evaluating LLMs on 2025 USA Math Olympiad,”CoRR, vol. abs/2503.21934, 2025
-
[14]
Formal mathematical reasoning: A new frontier in ai.arXiv preprint 2412.16075, 2024
K. Yang, G. Poesia, J. He, W. Li, K. Lauter, S. Chaudhuri, and D. Song, “Formal mathematical reasoning: A new frontier in AI,”CoRR, vol. abs/2412.16075, 2024
-
[15]
Towards reasoning in large language models: A survey,
J. Huang and K. C. Chang, “Towards reasoning in large language models: A survey,” inACL. Association for Computational Linguistics, 2023, pp. 1049–1065
work page 2023
-
[16]
Transformers as soft reasoners over language,
P. Clark, O. Tafjord, and K. Richardson, “Transformers as soft reasoners over language,” inIJCAI. ijcai.org, 2020, pp. 3882–3890
work page 2020
-
[17]
ProofWriter: Generating implications, proofs, and abductive statements over natural language,
O. Tafjord, B. Dalvi, and P. Clark, “ProofWriter: Generating implications, proofs, and abductive statements over natural language,” inACL/IJCNLP. Association for Computational Linguistics, 2021, pp. 3621–3634
work page 2021
-
[18]
Learning to reason via mixture-of-thought for logical reasoning,
T. Zheng, L. Chen, S. Han, R. T. McCoy, and H. Huang, “Learning to reason via mixture-of-thought for logical reasoning,”CoRR, vol. abs/2505.15817, 2025
-
[19]
Can ChatGPT support software verification?
C. Janßen, C. Richter, and H. Wehrheim, “Can ChatGPT support software verification?” inFASE, ser. LNCS, vol. 14573. Springer, 2024, pp. 266–279
work page 2024
-
[20]
Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma
A. Kamath, A. Senthilnathan, S. Chakraborty, P. Deligiannis, S. K. Lahiri, A. Lal, A. Rastogi, S. Roy, and R. Sharma, “Finding inductive loop invariants using large language models,”CoRR, vol. abs/2311.07948, 2023
-
[21]
Lemur: Integrating large language models in automated program verification,
H. Wu, C. W. Barrett, and N. Narodytska, “Lemur: Integrating large language models in automated program verification,” inICLR. OpenRe- view.net, 2024
work page 2024
-
[22]
Leveraging large language models for automated proof synthesis in Rust,
J. Yao, Z. Zhou, W. Chen, and W. Cui, “Leveraging large language models for automated proof synthesis in Rust,”CoRR, vol. abs/2311.03739, 2023
-
[23]
ClassInvGen: Class Invariant Synthesis using Large Language Models
C. Sun, V . Agashe, S. Chakraborty, J. Taneja, C. W. Barrett, D. L. Dill, X. Qiu, and S. K. Lahiri, “ClassInvGen: Class invariant synthesis using large language models,”CoRR, vol. abs/2502.18917, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2502.18917
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.18917 2025
-
[24]
Scalable, validated code translation of entire projects using large language models,
H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scalable, validated code translation of entire projects using large language models,”PLDI, pp. 1616–1641, 2025. [Online]. Available: https://doi.org/10.1145/3729315
-
[25]
VERT: verified equivalent rust transpilation with few- shot learning,
A. Z. H. Yang, Y . Takashima, B. Paulsen, J. Dodds, and D. Kroening, “VERT: verified equivalent Rust transpilation with few-shot learning,” CoRR, vol. abs/2404.18852, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2404.18852
-
[26]
M. Giacobbe, D. Kroening, A. Pal, and M. Tautschnig, “Neural model checking,” inNeurIPS, 2024
work page 2024
-
[27]
Using LLMs to facilitate formal verification of RTL,
M. Orenes-Vera, M. Martonosi, and D. Wentzlaff, “Using LLMs to facilitate formal verification of RTL,”CoRR, vol. abs/2309.09437, 2023
-
[28]
AssertLLM: Generating hardware verification assertions from design specifications via multi-LLMs,
Z. Yan, W. Fang, M. Li, M. Li, S. Liu, Z. Xie, and H. Zhang, “AssertLLM: Generating hardware verification assertions from design specifications via multi-LLMs,” inASPDAC. ACM, 2025, pp. 614–621
work page 2025
-
[29]
(Security) assertions by large language models,
R. Kande, H. Pearce, B. Tan, B. Dolan-Gavitt, S. Thakur, R. Karri, and J. Rajendran, “(Security) assertions by large language models,”IEEE Trans. Inf. Forensics Secur., vol. 19, pp. 4374–4389, 2024
work page 2024
-
[30]
VeriGen: A large language model for Verilog code generation,
S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “VeriGen: A large language model for Verilog code generation,” ACM Trans. Design Autom. Electr. Syst., vol. 29, no. 3, pp. 46:1–46:31, 2024
work page 2024
-
[31]
v2c – A Verilog to C translator,
R. Mukherjee, M. Tautschnig, and D. Kroening, “v2c – A Verilog to C translator,” inTACAS, ser. LNCS, vol. 9636. Springer, 2016, pp. 580–586
work page 2016
-
[32]
E. M. Clarke, O. Grumberg, D. Kroening, D. A. Peled, and H. Veith, Model checking, 2nd Edition. MIT Press, 2018
work page 2018
-
[33]
Boolean satisfiability solvers and their applications in model checking,
Y . Vizel, G. Weissenbacher, and S. Malik, “Boolean satisfiability solvers and their applications in model checking,”Proc. IEEE, vol. 103, no. 11, pp. 2021–2035, 2015
work page 2021
-
[34]
A. Ivrii and Y . Vizel, “Bit-level model checking,” inHandbook of Computer Architecture, A. Chattopadhyay, Ed. Springer, 2022, pp. 1–40
work page 2022
-
[35]
Detecting hallucinations in large language models using semantic entropy,
S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nat., vol. 630, no. 8017, pp. 625–630, 2024
work page 2024
-
[36]
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
J. Kossen, J. Han, M. Razzak, L. Schut, S. A. Malik, and Y . Gal, “Semantic entropy probes: Robust and cheap hallucination detection in LLMs,”CoRR, vol. abs/2406.15927, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inNeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
work page 2022
-
[38]
Sentence-BERT: Sentence embeddings using Siamese BERT-networks,
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” inEMNLP-IJCNLP. Association for Computational Linguistics, 2019, pp. 3980–3990
work page 2019
-
[39]
Annotation inference for modular checkers,
C. Flanagan, R. Joshi, and K. R. M. Leino, “Annotation inference for modular checkers,”Inf. Process. Lett., vol. 77, no. 2-4, pp. 97–108, 2001
work page 2001
-
[40]
Small inductive safe invariants,
A. Ivrii, A. Gurfinkel, and A. Belov, “Small inductive safe invariants,” in 2014 Formal Methods in Computer-Aided Design (FMCAD), 2014, pp. 115–122
work page 2014
-
[41]
Hardware verification using software analyzers,
R. Mukherjee, D. Kroening, and T. Melham, “Hardware verification using software analyzers,” inVLSI. IEEE, 2015, pp. 7–12
work page 2015
-
[42]
Checking safety properties using induction and a SAT-solver,
M. Sheeran, S. Singh, and G. St ˚almarck, “Checking safety properties using induction and a SAT-solver,” inFMCAD, ser. LNCS, vol. 1954. Springer, 2000, pp. 108–125
work page 1954
-
[43]
A. Biere, A. Cimatti, E. M. Clarke, O. Strichman, and Y . Zhu, “Bounded model checking,”Adv. Comput., vol. 58, pp. 117–148, 2003
work page 2003
-
[44]
VCEGAR: Verilog counterexample guided abstraction refinement,
H. Jain, D. Kroening, N. Sharygina, and E. M. Clarke, “VCEGAR: Verilog counterexample guided abstraction refinement,” inTACAS, ser. LNCS, vol
-
[45]
Springer, 2007, pp. 583–586
work page 2007
-
[46]
F. Somenzi, “VIS Verification Benchmarks,” ftp://vlsi.colorado.edu/pub/ vis/vis-verilog-models-1.3.tar.gz
-
[47]
Circuitcove.com, https://circuitcove.com/
-
[48]
Loop invariant generation with LLMs
Microsoft. Loop invariant generation with LLMs. [Online]. Available: https://github.com/microsoft/loop-invariant-gen-experiments
-
[49]
Hardware model checking competition 2024,
A. Biere, N. Froleyks, and M. Preiner, “Hardware model checking competition 2024,” inFormal Methods in Computer-Aided Design, FMCAD. IEEE, 2024, p. 1
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.