Large Lemma Miners: Can LLMs do Induction Proofs for Hardware?

Daniel Kroening; Michael Tautschnig; Romy Peled; Yakir Vizel

arxiv: 2511.02521 · v2 · submitted 2025-11-04 · 💻 cs.LO

Large Lemma Miners: Can LLMs do Induction Proofs for Hardware?

Romy Peled , Daniel Kroening , Michael Tautschnig , Yakir Vizel This is my paper

Pith reviewed 2026-05-18 01:31 UTC · model grok-4.3

classification 💻 cs.LO

keywords large language modelshardware verificationinductive invariantsformal verificationRTL designsinvariant generationneurosymbolic methods

0 comments

The pith

LLMs can produce provably correct inductive invariants for mid-size open-source RTL designs when given enough reprompting attempts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can automate the generation of inductive arguments required for formal hardware verification. It introduces a neurosymbolic method that prompts the models to propose candidate invariants and then uses symbolic tools to confirm which ones hold. On a collection of mid-size open-source designs, at least one prompting setup succeeded for 84 percent of the cases. A reader would care because current verification practice still depends on engineers manually crafting these invariants, a step that consumes significant time and expertise.

Core claim

The authors show that LLMs, guided by two different prompting frameworks and paired with formal checking, can generate inductive arguments for hardware designs. With sufficient reprompting, the method produces a provably correct inductive argument for 84 percent of the mid-size open-source RTL designs in their test set.

What carries the argument

Neurosymbolic prompting frameworks that propose candidate invariants for later verification by a formal symbolic tool.

If this is right

Formal verification engineers could offload part of the invariant-generation task to an LLM-assisted workflow.
The time required to verify mid-size hardware modules would decrease once candidate invariants are produced automatically.
The same combination of prompting and symbolic checking could be applied to additional open-source or similar-scale designs.
Industrial hardware teams might integrate the method into existing verification pipelines to reduce manual effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Testing the same prompting setups on designs that exceed the current size range would reveal whether the success rate holds as complexity grows.
Adapting the approach to software verification or other domains that rely on inductive reasoning could extend its reach beyond hardware.
Combining the LLM proposals with additional automated proof strategies might further increase the fraction of problems solved without human input.

Load-bearing premise

The tested mid-size open-source RTL designs represent the kinds of industrial verification tasks that still require manual invariant generation today.

What would settle it

A single larger or proprietary design on which repeated prompting and checking never yields a correct inductive argument would falsify the reported success rate.

Figures

Figures reproduced from arXiv: 2511.02521 by Daniel Kroening, Michael Tautschnig, Romy Peled, Yakir Vizel.

read the original abstract

Large Language Models (LLMs) have shown potential for solving mathematical tasks. We show that LLMs can be utilized to generate proofs by induction for hardware verification and thereby replace some of the manual work done by Formal Verification engineers and deliver industrial value. We present a neurosymbolic approach that includes two prompting frameworks to generate candidate invariants, which are checked using a formal, symbolic tool. Our results indicate that with sufficient reprompting, LLMs are able to generate inductive arguments for mid-size open-source RTL designs. For 84% of our problem set, at least one of the prompt setups succeeded in producing a provably correct inductive argument.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a neurosymbolic approach in which LLMs are used via two prompting frameworks to generate candidate inductive invariants for RTL hardware designs; these candidates are then validated by an external formal symbolic checker. The central empirical claim is that, with sufficient reprompting, the method produces provably correct inductive arguments for 84% of a problem set consisting of mid-size open-source RTL designs.

Significance. If the evaluation is sound, the work provides concrete evidence that LLMs can generate machine-checkable inductive lemmas for hardware verification tasks, offering a potential reduction in manual invariant engineering for mid-sized designs. The use of an external formal checker supplies an independent soundness guarantee that strengthens the experimental results. The industrial-value claim, however, rests on an untested assumption that the chosen open-source designs are representative of proprietary verification workloads.

major comments (2)

[Evaluation] Evaluation section: the 84% success rate is reported for the problem set, but the text does not clarify whether this figure counts designs for which at least one of multiple independent prompting attempts succeeded or whether any post-hoc filtering of designs or prompts occurred. This detail is load-bearing for the reliability claim.
[Results / Evaluation] §4 (or equivalent results section): no quantitative comparison is supplied between the evaluated mid-size open-source RTL designs and industrial benchmarks (e.g., register count, cone-of-influence size, or state-space cardinality). Without such data the extrapolation to “industrial value” and replacement of manual invariant work remains unsupported.

minor comments (2)

[Method] The two prompting frameworks are described at a high level; a concrete example of a generated invariant together with the corresponding RTL fragment would improve reproducibility.
[Related Work] The paper should cite prior work on LLM-assisted lemma generation in software verification (e.g., Dafny or Coq contexts) to better situate the hardware-specific contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments below and have revised the manuscript to improve clarity and support for our claims where possible.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the 84% success rate is reported for the problem set, but the text does not clarify whether this figure counts designs for which at least one of multiple independent prompting attempts succeeded or whether any post-hoc filtering of designs or prompts occurred. This detail is load-bearing for the reliability claim.

Authors: The manuscript abstract already states that 'For 84% of our problem set, at least one of the prompt setups succeeded in producing a provably correct inductive argument.' We agree this should be explicit in the Evaluation section as well. In the revised manuscript we have added text clarifying that the reported 84% is the fraction of designs for which at least one of the two prompting frameworks, after reprompting, produced a valid inductive argument. No post-hoc filtering of designs or prompts was performed; the entire problem set was evaluated uniformly. revision: yes
Referee: [Results / Evaluation] §4 (or equivalent results section): no quantitative comparison is supplied between the evaluated mid-size open-source RTL designs and industrial benchmarks (e.g., register count, cone-of-influence size, or state-space cardinality). Without such data the extrapolation to “industrial value” and replacement of manual invariant work remains unsupported.

Authors: We agree that quantitative metrics would strengthen the discussion of industrial relevance. The revised manuscript now includes additional data on the evaluated designs, such as register counts, module sizes, and estimated state-space cardinalities. Direct numerical comparison to proprietary industrial benchmarks is not possible, but we have added discussion noting that the chosen open-source RTL modules are representative of mid-sized verification problems in terms of complexity. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation of LLM-generated invariants verified by external formal tools

full rationale

The paper reports an experimental neurosymbolic method in which LLMs propose candidate inductive invariants for RTL designs; these candidates are then checked for correctness by an independent symbolic formal verifier. The reported 84% success rate is obtained by direct measurement on a fixed problem set of open-source designs, with no fitted parameters, self-referential equations, or load-bearing self-citations that reduce the central claim to its own inputs. The derivation chain consists of prompting, candidate generation, and external verification; none of these steps collapses by construction into a renaming or re-use of the measured outcome itself. The evaluation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-generated invariants can be mechanically checked for inductiveness by existing symbolic tools and that the chosen problem set reflects real verification workloads. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Formal symbolic tools can correctly decide inductiveness of candidate invariants for RTL designs
Invoked when the paper states that generated invariants are checked using a formal, symbolic tool.

pith-pipeline@v0.9.0 · 5641 in / 1198 out tokens · 36806 ms · 2026-05-18T01:31:20.814089+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a neurosymbolic approach that includes two prompting frameworks to generate candidate invariants, which are checked using a formal, symbolic tool.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results indicate that with sufficient reprompting, LLMs are able to generate inductive arguments for mid-size open-source RTL designs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

t\"{a}k\={o}Formal: Enabling Robust Software for Programmable Memory Hierarchies (Extended Version)
cs.AR 2026-05 unverdicted novelty 7.0

An ISA-level memory consistency model for tākō is introduced and proven sound by verifying that executions of an implementation model are permitted by the model.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

[Online]

deepmind.google. [Online]. Available: https://tinyurl.com/5n7abr7h

work page
[2]

Advancing mathematics by guiding human intuition with AI,

A. Davies, P. Velickovic, L. Buesing, S. Blackwell, D. Zheng, N. Tomasev, R. Tanburn, P. W. Battaglia, C. Blundell, A. Juh ´asz, M. Lackenby, G. Williamson, D. Hassabis, and P. Kohli, “Advancing mathematics by guiding human intuition with AI,”Nat., vol. 600, no. 7887, pp. 70–74, 2021

work page 2021
[3]

Discovering faster matrix multipli- cation algorithms with reinforcement learning,

A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz, D. Silver, D. Hassabis, and P. Kohli, “Discovering faster matrix multipli- cation algorithms with reinforcement learning,”Nat., vol. 610, no. 7930, pp. 47–53, 2022

work page 2022
[4]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

A. Novikov, N. Vu, M. Eisenberger, E. Dupont, P. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog, “AlphaEvolve: A coding agent for scientific and algorithmic discovery,”CoRR, vol. abs/2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Global Lyapunov functions: A long-standing open problem in mathematics, with symbolic transformers,

A. Alfarano, F. Charton, and A. Hayat, “Global Lyapunov functions: A long-standing open problem in mathematics, with symbolic transformers,” inNeurIPS, 2024

work page 2024
[6]

Machine learning invariants of arithmetic curves,

Y . He, K. Lee, and T. Oliver, “Machine learning invariants of arithmetic curves,”J. Symb. Comput., vol. 115, pp. 478–491, 2023

work page 2023
[7]

Predicting root numbers with neural networks,

A. Pozdnyakov, “Predicting root numbers with neural networks,”Int. J. Data Sci. Math. Sci., vol. 2, no. 1, pp. 15–37, 2024

work page 2024
[8]

Learning Euler factors of elliptic curves,

A. Babei, F. Charton, E. Costa, X. Huang, K. Lee, D. Lowry-Duda, A. Narayanan, and A. Pozdnyakov, “Learning Euler factors of elliptic curves,”CoRR, vol. abs/2502.10357, 2025

work page arXiv 2025
[9]

Mathematical discoveries from program search with large language models,

B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. R. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi, P. Kohli, and A. Fawzi, “Mathematical discoveries from program search with large language models,”Nat., vol. 625, no. 7995, pp. 468–475, 2024

work page 2024
[10]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,

A. Q. Jiang, S. Welleck, J. P. Zhou, T. Lacroix, J. Liu, W. Li, M. Jamnik, G. Lample, and Y . Wu, “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,” inICLR. OpenReview.net, 2023

work page 2023
[11]

Lean-STaR: Learning to interleave thinking and proving,

H. Lin, Z. Sun, S. Welleck, and Y . Yang, “Lean-STaR: Learning to interleave thinking and proving,” inICLR. OpenReview.net, 2025

work page 2025
[12]

arXiv preprint arXiv:2410.07432 , year =

L. Pan, V . Ganesh, J. D. Abernethy, C. Esposo, and W. Lee, “Can transformers reason logically? A study in SAT solving,”CoRR, vol. abs/2410.07432, 2024

work page arXiv 2024
[13]

Purvis, B., Mao, Y., and Robinson, D

I. Petrov, J. Dekoninck, L. Baltadzhiev, M. Drencheva, K. Minchev, M. Balunovic, N. Jovanovic, and M. T. Vechev, “Proof or bluff? Evaluating LLMs on 2025 USA Math Olympiad,”CoRR, vol. abs/2503.21934, 2025

work page arXiv 2025
[14]

Formal mathematical reasoning: A new frontier in ai.arXiv preprint 2412.16075, 2024

K. Yang, G. Poesia, J. He, W. Li, K. Lauter, S. Chaudhuri, and D. Song, “Formal mathematical reasoning: A new frontier in AI,”CoRR, vol. abs/2412.16075, 2024

work page arXiv 2024
[15]

Towards reasoning in large language models: A survey,

J. Huang and K. C. Chang, “Towards reasoning in large language models: A survey,” inACL. Association for Computational Linguistics, 2023, pp. 1049–1065

work page 2023
[16]

Transformers as soft reasoners over language,

P. Clark, O. Tafjord, and K. Richardson, “Transformers as soft reasoners over language,” inIJCAI. ijcai.org, 2020, pp. 3882–3890

work page 2020
[17]

ProofWriter: Generating implications, proofs, and abductive statements over natural language,

O. Tafjord, B. Dalvi, and P. Clark, “ProofWriter: Generating implications, proofs, and abductive statements over natural language,” inACL/IJCNLP. Association for Computational Linguistics, 2021, pp. 3621–3634

work page 2021
[18]

Learning to reason via mixture-of-thought for logical reasoning,

T. Zheng, L. Chen, S. Han, R. T. McCoy, and H. Huang, “Learning to reason via mixture-of-thought for logical reasoning,”CoRR, vol. abs/2505.15817, 2025

work page arXiv 2025
[19]

Can ChatGPT support software verification?

C. Janßen, C. Richter, and H. Wehrheim, “Can ChatGPT support software verification?” inFASE, ser. LNCS, vol. 14573. Springer, 2024, pp. 266–279

work page 2024
[20]

Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma

A. Kamath, A. Senthilnathan, S. Chakraborty, P. Deligiannis, S. K. Lahiri, A. Lal, A. Rastogi, S. Roy, and R. Sharma, “Finding inductive loop invariants using large language models,”CoRR, vol. abs/2311.07948, 2023

work page arXiv 2023
[21]

Lemur: Integrating large language models in automated program verification,

H. Wu, C. W. Barrett, and N. Narodytska, “Lemur: Integrating large language models in automated program verification,” inICLR. OpenRe- view.net, 2024

work page 2024
[22]

Leveraging large language models for automated proof synthesis in Rust,

J. Yao, Z. Zhou, W. Chen, and W. Cui, “Leveraging large language models for automated proof synthesis in Rust,”CoRR, vol. abs/2311.03739, 2023

work page arXiv 2023
[23]

ClassInvGen: Class Invariant Synthesis using Large Language Models

C. Sun, V . Agashe, S. Chakraborty, J. Taneja, C. W. Barrett, D. L. Dill, X. Qiu, and S. K. Lahiri, “ClassInvGen: Class invariant synthesis using large language models,”CoRR, vol. abs/2502.18917, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2502.18917

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.18917 2025
[24]

Scalable, validated code translation of entire projects using large language models,

H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scalable, validated code translation of entire projects using large language models,”PLDI, pp. 1616–1641, 2025. [Online]. Available: https://doi.org/10.1145/3729315

work page doi:10.1145/3729315 2025
[25]

VERT: verified equivalent rust transpilation with few- shot learning,

A. Z. H. Yang, Y . Takashima, B. Paulsen, J. Dodds, and D. Kroening, “VERT: verified equivalent Rust transpilation with few-shot learning,” CoRR, vol. abs/2404.18852, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2404.18852

work page arXiv 2024
[26]

Neural model checking,

M. Giacobbe, D. Kroening, A. Pal, and M. Tautschnig, “Neural model checking,” inNeurIPS, 2024

work page 2024
[27]

Using LLMs to facilitate formal verification of RTL,

M. Orenes-Vera, M. Martonosi, and D. Wentzlaff, “Using LLMs to facilitate formal verification of RTL,”CoRR, vol. abs/2309.09437, 2023

work page arXiv 2023
[28]

AssertLLM: Generating hardware verification assertions from design specifications via multi-LLMs,

Z. Yan, W. Fang, M. Li, M. Li, S. Liu, Z. Xie, and H. Zhang, “AssertLLM: Generating hardware verification assertions from design specifications via multi-LLMs,” inASPDAC. ACM, 2025, pp. 614–621

work page 2025
[29]

(Security) assertions by large language models,

R. Kande, H. Pearce, B. Tan, B. Dolan-Gavitt, S. Thakur, R. Karri, and J. Rajendran, “(Security) assertions by large language models,”IEEE Trans. Inf. Forensics Secur., vol. 19, pp. 4374–4389, 2024

work page 2024
[30]

VeriGen: A large language model for Verilog code generation,

S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “VeriGen: A large language model for Verilog code generation,” ACM Trans. Design Autom. Electr. Syst., vol. 29, no. 3, pp. 46:1–46:31, 2024

work page 2024
[31]

v2c – A Verilog to C translator,

R. Mukherjee, M. Tautschnig, and D. Kroening, “v2c – A Verilog to C translator,” inTACAS, ser. LNCS, vol. 9636. Springer, 2016, pp. 580–586

work page 2016
[32]

E. M. Clarke, O. Grumberg, D. Kroening, D. A. Peled, and H. Veith, Model checking, 2nd Edition. MIT Press, 2018

work page 2018
[33]

Boolean satisfiability solvers and their applications in model checking,

Y . Vizel, G. Weissenbacher, and S. Malik, “Boolean satisfiability solvers and their applications in model checking,”Proc. IEEE, vol. 103, no. 11, pp. 2021–2035, 2015

work page 2021
[34]

Bit-level model checking,

A. Ivrii and Y . Vizel, “Bit-level model checking,” inHandbook of Computer Architecture, A. Chattopadhyay, Ed. Springer, 2022, pp. 1–40

work page 2022
[35]

Detecting hallucinations in large language models using semantic entropy,

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nat., vol. 630, no. 8017, pp. 625–630, 2024

work page 2024
[36]

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

J. Kossen, J. Han, M. Razzak, L. Schut, S. A. Malik, and Y . Gal, “Semantic entropy probes: Robust and cheap hallucination detection in LLMs,”CoRR, vol. abs/2406.15927, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inNeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

work page 2022
[38]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” inEMNLP-IJCNLP. Association for Computational Linguistics, 2019, pp. 3980–3990

work page 2019
[39]

Annotation inference for modular checkers,

C. Flanagan, R. Joshi, and K. R. M. Leino, “Annotation inference for modular checkers,”Inf. Process. Lett., vol. 77, no. 2-4, pp. 97–108, 2001

work page 2001
[40]

Small inductive safe invariants,

A. Ivrii, A. Gurfinkel, and A. Belov, “Small inductive safe invariants,” in 2014 Formal Methods in Computer-Aided Design (FMCAD), 2014, pp. 115–122

work page 2014
[41]

Hardware verification using software analyzers,

R. Mukherjee, D. Kroening, and T. Melham, “Hardware verification using software analyzers,” inVLSI. IEEE, 2015, pp. 7–12

work page 2015
[42]

Checking safety properties using induction and a SAT-solver,

M. Sheeran, S. Singh, and G. St ˚almarck, “Checking safety properties using induction and a SAT-solver,” inFMCAD, ser. LNCS, vol. 1954. Springer, 2000, pp. 108–125

work page 1954
[43]

Bounded model checking,

A. Biere, A. Cimatti, E. M. Clarke, O. Strichman, and Y . Zhu, “Bounded model checking,”Adv. Comput., vol. 58, pp. 117–148, 2003

work page 2003
[44]

VCEGAR: Verilog counterexample guided abstraction refinement,

H. Jain, D. Kroening, N. Sharygina, and E. M. Clarke, “VCEGAR: Verilog counterexample guided abstraction refinement,” inTACAS, ser. LNCS, vol

work page
[45]

Springer, 2007, pp. 583–586

work page 2007
[46]

VIS Verification Benchmarks,

F. Somenzi, “VIS Verification Benchmarks,” ftp://vlsi.colorado.edu/pub/ vis/vis-verilog-models-1.3.tar.gz

work page
[47]

Circuitcove.com, https://circuitcove.com/

work page
[48]

Loop invariant generation with LLMs

Microsoft. Loop invariant generation with LLMs. [Online]. Available: https://github.com/microsoft/loop-invariant-gen-experiments

work page
[49]

Hardware model checking competition 2024,

A. Biere, N. Froleyks, and M. Preiner, “Hardware model checking competition 2024,” inFormal Methods in Computer-Aided Design, FMCAD. IEEE, 2024, p. 1

work page 2024

[1] [1]

[Online]

deepmind.google. [Online]. Available: https://tinyurl.com/5n7abr7h

work page

[2] [2]

Advancing mathematics by guiding human intuition with AI,

A. Davies, P. Velickovic, L. Buesing, S. Blackwell, D. Zheng, N. Tomasev, R. Tanburn, P. W. Battaglia, C. Blundell, A. Juh ´asz, M. Lackenby, G. Williamson, D. Hassabis, and P. Kohli, “Advancing mathematics by guiding human intuition with AI,”Nat., vol. 600, no. 7887, pp. 70–74, 2021

work page 2021

[3] [3]

Discovering faster matrix multipli- cation algorithms with reinforcement learning,

A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz, D. Silver, D. Hassabis, and P. Kohli, “Discovering faster matrix multipli- cation algorithms with reinforcement learning,”Nat., vol. 610, no. 7930, pp. 47–53, 2022

work page 2022

[4] [4]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

A. Novikov, N. Vu, M. Eisenberger, E. Dupont, P. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog, “AlphaEvolve: A coding agent for scientific and algorithmic discovery,”CoRR, vol. abs/2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Global Lyapunov functions: A long-standing open problem in mathematics, with symbolic transformers,

A. Alfarano, F. Charton, and A. Hayat, “Global Lyapunov functions: A long-standing open problem in mathematics, with symbolic transformers,” inNeurIPS, 2024

work page 2024

[6] [6]

Machine learning invariants of arithmetic curves,

Y . He, K. Lee, and T. Oliver, “Machine learning invariants of arithmetic curves,”J. Symb. Comput., vol. 115, pp. 478–491, 2023

work page 2023

[7] [7]

Predicting root numbers with neural networks,

A. Pozdnyakov, “Predicting root numbers with neural networks,”Int. J. Data Sci. Math. Sci., vol. 2, no. 1, pp. 15–37, 2024

work page 2024

[8] [8]

Learning Euler factors of elliptic curves,

A. Babei, F. Charton, E. Costa, X. Huang, K. Lee, D. Lowry-Duda, A. Narayanan, and A. Pozdnyakov, “Learning Euler factors of elliptic curves,”CoRR, vol. abs/2502.10357, 2025

work page arXiv 2025

[9] [9]

Mathematical discoveries from program search with large language models,

B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. R. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi, P. Kohli, and A. Fawzi, “Mathematical discoveries from program search with large language models,”Nat., vol. 625, no. 7995, pp. 468–475, 2024

work page 2024

[10] [10]

Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,

A. Q. Jiang, S. Welleck, J. P. Zhou, T. Lacroix, J. Liu, W. Li, M. Jamnik, G. Lample, and Y . Wu, “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,” inICLR. OpenReview.net, 2023

work page 2023

[11] [11]

Lean-STaR: Learning to interleave thinking and proving,

H. Lin, Z. Sun, S. Welleck, and Y . Yang, “Lean-STaR: Learning to interleave thinking and proving,” inICLR. OpenReview.net, 2025

work page 2025

[12] [12]

arXiv preprint arXiv:2410.07432 , year =

L. Pan, V . Ganesh, J. D. Abernethy, C. Esposo, and W. Lee, “Can transformers reason logically? A study in SAT solving,”CoRR, vol. abs/2410.07432, 2024

work page arXiv 2024

[13] [13]

Purvis, B., Mao, Y., and Robinson, D

I. Petrov, J. Dekoninck, L. Baltadzhiev, M. Drencheva, K. Minchev, M. Balunovic, N. Jovanovic, and M. T. Vechev, “Proof or bluff? Evaluating LLMs on 2025 USA Math Olympiad,”CoRR, vol. abs/2503.21934, 2025

work page arXiv 2025

[14] [14]

Formal mathematical reasoning: A new frontier in ai.arXiv preprint 2412.16075, 2024

K. Yang, G. Poesia, J. He, W. Li, K. Lauter, S. Chaudhuri, and D. Song, “Formal mathematical reasoning: A new frontier in AI,”CoRR, vol. abs/2412.16075, 2024

work page arXiv 2024

[15] [15]

Towards reasoning in large language models: A survey,

J. Huang and K. C. Chang, “Towards reasoning in large language models: A survey,” inACL. Association for Computational Linguistics, 2023, pp. 1049–1065

work page 2023

[16] [16]

Transformers as soft reasoners over language,

P. Clark, O. Tafjord, and K. Richardson, “Transformers as soft reasoners over language,” inIJCAI. ijcai.org, 2020, pp. 3882–3890

work page 2020

[17] [17]

ProofWriter: Generating implications, proofs, and abductive statements over natural language,

O. Tafjord, B. Dalvi, and P. Clark, “ProofWriter: Generating implications, proofs, and abductive statements over natural language,” inACL/IJCNLP. Association for Computational Linguistics, 2021, pp. 3621–3634

work page 2021

[18] [18]

Learning to reason via mixture-of-thought for logical reasoning,

T. Zheng, L. Chen, S. Han, R. T. McCoy, and H. Huang, “Learning to reason via mixture-of-thought for logical reasoning,”CoRR, vol. abs/2505.15817, 2025

work page arXiv 2025

[19] [19]

Can ChatGPT support software verification?

C. Janßen, C. Richter, and H. Wehrheim, “Can ChatGPT support software verification?” inFASE, ser. LNCS, vol. 14573. Springer, 2024, pp. 266–279

work page 2024

[20] [20]

Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma

A. Kamath, A. Senthilnathan, S. Chakraborty, P. Deligiannis, S. K. Lahiri, A. Lal, A. Rastogi, S. Roy, and R. Sharma, “Finding inductive loop invariants using large language models,”CoRR, vol. abs/2311.07948, 2023

work page arXiv 2023

[21] [21]

Lemur: Integrating large language models in automated program verification,

H. Wu, C. W. Barrett, and N. Narodytska, “Lemur: Integrating large language models in automated program verification,” inICLR. OpenRe- view.net, 2024

work page 2024

[22] [22]

Leveraging large language models for automated proof synthesis in Rust,

J. Yao, Z. Zhou, W. Chen, and W. Cui, “Leveraging large language models for automated proof synthesis in Rust,”CoRR, vol. abs/2311.03739, 2023

work page arXiv 2023

[23] [23]

ClassInvGen: Class Invariant Synthesis using Large Language Models

C. Sun, V . Agashe, S. Chakraborty, J. Taneja, C. W. Barrett, D. L. Dill, X. Qiu, and S. K. Lahiri, “ClassInvGen: Class invariant synthesis using large language models,”CoRR, vol. abs/2502.18917, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2502.18917

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.18917 2025

[24] [24]

Scalable, validated code translation of entire projects using large language models,

H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scalable, validated code translation of entire projects using large language models,”PLDI, pp. 1616–1641, 2025. [Online]. Available: https://doi.org/10.1145/3729315

work page doi:10.1145/3729315 2025

[25] [25]

VERT: verified equivalent rust transpilation with few- shot learning,

A. Z. H. Yang, Y . Takashima, B. Paulsen, J. Dodds, and D. Kroening, “VERT: verified equivalent Rust transpilation with few-shot learning,” CoRR, vol. abs/2404.18852, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2404.18852

work page arXiv 2024

[26] [26]

Neural model checking,

M. Giacobbe, D. Kroening, A. Pal, and M. Tautschnig, “Neural model checking,” inNeurIPS, 2024

work page 2024

[27] [27]

Using LLMs to facilitate formal verification of RTL,

M. Orenes-Vera, M. Martonosi, and D. Wentzlaff, “Using LLMs to facilitate formal verification of RTL,”CoRR, vol. abs/2309.09437, 2023

work page arXiv 2023

[28] [28]

AssertLLM: Generating hardware verification assertions from design specifications via multi-LLMs,

Z. Yan, W. Fang, M. Li, M. Li, S. Liu, Z. Xie, and H. Zhang, “AssertLLM: Generating hardware verification assertions from design specifications via multi-LLMs,” inASPDAC. ACM, 2025, pp. 614–621

work page 2025

[29] [29]

(Security) assertions by large language models,

R. Kande, H. Pearce, B. Tan, B. Dolan-Gavitt, S. Thakur, R. Karri, and J. Rajendran, “(Security) assertions by large language models,”IEEE Trans. Inf. Forensics Secur., vol. 19, pp. 4374–4389, 2024

work page 2024

[30] [30]

VeriGen: A large language model for Verilog code generation,

S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “VeriGen: A large language model for Verilog code generation,” ACM Trans. Design Autom. Electr. Syst., vol. 29, no. 3, pp. 46:1–46:31, 2024

work page 2024

[31] [31]

v2c – A Verilog to C translator,

R. Mukherjee, M. Tautschnig, and D. Kroening, “v2c – A Verilog to C translator,” inTACAS, ser. LNCS, vol. 9636. Springer, 2016, pp. 580–586

work page 2016

[32] [32]

E. M. Clarke, O. Grumberg, D. Kroening, D. A. Peled, and H. Veith, Model checking, 2nd Edition. MIT Press, 2018

work page 2018

[33] [33]

Boolean satisfiability solvers and their applications in model checking,

Y . Vizel, G. Weissenbacher, and S. Malik, “Boolean satisfiability solvers and their applications in model checking,”Proc. IEEE, vol. 103, no. 11, pp. 2021–2035, 2015

work page 2021

[34] [34]

Bit-level model checking,

A. Ivrii and Y . Vizel, “Bit-level model checking,” inHandbook of Computer Architecture, A. Chattopadhyay, Ed. Springer, 2022, pp. 1–40

work page 2022

[35] [35]

Detecting hallucinations in large language models using semantic entropy,

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nat., vol. 630, no. 8017, pp. 625–630, 2024

work page 2024

[36] [36]

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

J. Kossen, J. Han, M. Razzak, L. Schut, S. A. Malik, and Y . Gal, “Semantic entropy probes: Robust and cheap hallucination detection in LLMs,”CoRR, vol. abs/2406.15927, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inNeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

work page 2022

[38] [38]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” inEMNLP-IJCNLP. Association for Computational Linguistics, 2019, pp. 3980–3990

work page 2019

[39] [39]

Annotation inference for modular checkers,

C. Flanagan, R. Joshi, and K. R. M. Leino, “Annotation inference for modular checkers,”Inf. Process. Lett., vol. 77, no. 2-4, pp. 97–108, 2001

work page 2001

[40] [40]

Small inductive safe invariants,

A. Ivrii, A. Gurfinkel, and A. Belov, “Small inductive safe invariants,” in 2014 Formal Methods in Computer-Aided Design (FMCAD), 2014, pp. 115–122

work page 2014

[41] [41]

Hardware verification using software analyzers,

R. Mukherjee, D. Kroening, and T. Melham, “Hardware verification using software analyzers,” inVLSI. IEEE, 2015, pp. 7–12

work page 2015

[42] [42]

Checking safety properties using induction and a SAT-solver,

M. Sheeran, S. Singh, and G. St ˚almarck, “Checking safety properties using induction and a SAT-solver,” inFMCAD, ser. LNCS, vol. 1954. Springer, 2000, pp. 108–125

work page 1954

[43] [43]

Bounded model checking,

A. Biere, A. Cimatti, E. M. Clarke, O. Strichman, and Y . Zhu, “Bounded model checking,”Adv. Comput., vol. 58, pp. 117–148, 2003

work page 2003

[44] [44]

VCEGAR: Verilog counterexample guided abstraction refinement,

H. Jain, D. Kroening, N. Sharygina, and E. M. Clarke, “VCEGAR: Verilog counterexample guided abstraction refinement,” inTACAS, ser. LNCS, vol

work page

[45] [45]

Springer, 2007, pp. 583–586

work page 2007

[46] [46]

VIS Verification Benchmarks,

F. Somenzi, “VIS Verification Benchmarks,” ftp://vlsi.colorado.edu/pub/ vis/vis-verilog-models-1.3.tar.gz

work page

[47] [47]

Circuitcove.com, https://circuitcove.com/

work page

[48] [48]

Loop invariant generation with LLMs

Microsoft. Loop invariant generation with LLMs. [Online]. Available: https://github.com/microsoft/loop-invariant-gen-experiments

work page

[49] [49]

Hardware model checking competition 2024,

A. Biere, N. Froleyks, and M. Preiner, “Hardware model checking competition 2024,” inFormal Methods in Computer-Aided Design, FMCAD. IEEE, 2024, p. 1

work page 2024