Cobblestone: A Divide-and-Conquer Approach for Automating Formal Verification
Pith reviewed 2026-05-23 18:08 UTC · model grok-4.3
The pith
Cobblestone automates Coq proof synthesis by using LLMs to decompose goals into subgoals, automatically isolating successful subproofs, and iterating until a complete sound proof is built.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cobblestone decomposes proof synthesis into an iterative loop: an LLM generates a candidate proof script that breaks the current goal into subgoals; the system extracts and verifies any subgoals that were successfully discharged; the remaining open subgoals become the new targets; and the loop continues until every subgoal is closed, at which point the assembled script constitutes a machine-checked, sound Coq proof.
What carries the argument
The divide-and-conquer iteration that uses LLM-generated scripts to split goals, automatically identifies proven subgoals, and recurses on the unsolved remainder.
If this is right
- Fully automatic runs outperform state-of-the-art non-LLM tools on four open-source Coq benchmarks.
- The method proves theorems that other LLM-based tools leave unproven on multiple benchmarks.
- On several benchmarks it also exceeds the success rates of other LLM tools.
- Average cost is $1.25 and average runtime is 14.7 minutes per theorem.
- When given external input such as a proof skeleton or relevant lemmas, success reaches up to 58 percent.
Where Pith is reading between the lines
- The same decomposition loop could be ported to other interactive theorem provers that expose subgoal structure.
- Pairing the method with stronger future language models would raise the fraction of theorems solved before any external oracle is needed.
- User interfaces that let developers supply high-level lemmas or proof outlines could turn Cobblestone into a practical assistant rather than a fully automatic tool.
- If the iteration converges reliably, the technique reduces the manual effort required to verify properties of real software libraries.
Load-bearing premise
LLM outputs can be parsed into subgoals whose successful proofs are reliably detected so that repeated decomposition eventually yields a complete, checkable proof.
What would settle it
A fresh benchmark of Coq theorems drawn from projects absent from training data where the fraction of theorems proved by Cobblestone falls below the best non-LLM baseline or the best prior LLM baseline.
read the original abstract
Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but requires significant effort and expertise. Machine learning can automatically synthesize proofs, but such tools are able to prove only a fraction of desired software properties. We introduce Cobblestone, a divide-and-conquer approach for proof synthesis. Cobblestone uses a large language model (LLM) to generate potential proofs, uses those proofs to break the problem into simpler parts, automatically identifies which of those parts were successfully proven, and iterates on the remaining parts to build a correct proof that is guaranteed to be sound, despite the reliance on unsound LLMs. We evaluate Cobblestone on four benchmarks of open-source Coq projects, controlling for training data leakage. Fully automatically, Cobblestone outperforms state-of-the-art non-LLM tools, and proves many theorems that other LLM-based tools cannot, and on many benchmarks, outperforms them. Each Cobblestone run costs only $1.25 and takes 14.7 minutes, on average. Cobblestone can also be used with external input, from a user or another tool, providing a proof structure or relevant lemmas. Evaluated with such an oracle, Cobblestone proves up to 58% of theorems. Overall, our research shows that tools can make use of partial progress and external input to more effectively automate formal verification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Cobblestone, a divide-and-conquer method for proof synthesis in Coq using large language models (LLMs). The approach generates potential proofs with an LLM to break down problems, identifies successful subproofs automatically, and iterates on remaining parts to construct a sound proof verified by the proof assistant. It reports outperforming state-of-the-art non-LLM tools on four benchmarks of open-source Coq projects (with training data leakage controls), proving theorems that other LLM-based tools cannot, at an average cost of $1.25 and 14.7 minutes per run. With external oracle input, it achieves up to 58% theorem proving success.
Significance. If the empirical claims hold under detailed scrutiny, this work would be significant for the formal verification community. It demonstrates a practical way to leverage the strengths of LLMs for proof generation while maintaining soundness guarantees through the proof assistant, addressing a key limitation of pure LLM-based approaches. The reported outperformance on benchmarks, low computational cost, and ability to incorporate external input suggest potential for broader adoption in automating parts of formal verification tasks.
minor comments (1)
- [Abstract] The abstract provides a high-level overview but lacks specific details on the benchmarks used, the exact baselines compared against, or the statistical significance of the performance improvements.
Simulated Author's Rebuttal
We thank the referee for their summary of our manuscript and for recognizing its potential significance to the formal verification community. The report lists no specific major comments, so we have no individual points to address. We are prepared to provide clarifications or additional experiments if the referee has further questions.
Circularity Check
No significant circularity identified
full rationale
The paper is an empirical systems paper describing a divide-and-conquer proof synthesis tool. Its central claims concern measured performance on external Coq benchmarks and a soundness guarantee supplied by the proof assistant rather than the LLM. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract. The described workflow (LLM suggestions decomposed and verified by the assistant) is externally falsifiable against independent benchmarks and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-generated partial proofs permit decomposition into subgoals that can be automatically verified as successful or not
- domain assumption The iteration process can reach a complete sound proof from partial progress without additional unstated mechanisms
Forward citations
Cited by 7 Pith papers
-
A Learning Method for Symbolic Systems Using Large Language Models
LLM2Ltac mines symbolic tactics from 11,725 Coq theorems using LLMs and integrates them into CoqHammer, improving proof rates by 23.87% on 6,199 theorems from four large verification projects.
-
Certified Program Synthesis with a Multi-Modal Verifier
LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, out...
-
The Search for Constrained Random Generators
A Lean library called Palamedes uses synthesis rules from generator semantics and catamorphism-anamorphism rewriting to automatically produce correct constrained random generators.
-
AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification
AutoSOUP automates component-level memory-safety verification by generating Safety-Oriented Unit Proofs via three techniques and a hybrid LLM-plus-program-synthesis architecture called LLM-As-Function-Call.
-
PROMISE: Proof Automation as Structural Imitation of Human Reasoning
PROMISE reframes automated proof generation as stateful search over structural embeddings of proof states, outperforming prior LLM-based systems by up to 26 points on the seL4 benchmark.
-
Understanding and Improving Automated Proof Synthesis for Interactive Theorem Provers
A pattern-guided tactic search method improves automated proof synthesis success rates by an average of 8.05% and achieves a 20% increase on previously unproven theorems.
-
On Reasoning-Centric LLM-based Automated Theorem Proving
ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.