Cobblestone: A Divide-and-Conquer Approach for Automating Formal Verification

Arpan Agarwal; Emily First; Saketh Ram Kasibatla; Sorin Lerner; Talia Ringer; Yuriy Brun

arxiv: 2410.19940 · v4 · submitted 2024-10-25 · 💻 cs.LO · cs.AI· cs.PL

Cobblestone: A Divide-and-Conquer Approach for Automating Formal Verification

Saketh Ram Kasibatla , Arpan Agarwal , Yuriy Brun , Sorin Lerner , Talia Ringer , Emily First This is my paper

Pith reviewed 2026-05-23 18:08 UTC · model grok-4.3

classification 💻 cs.LO cs.AIcs.PL

keywords formal verificationCoqproof synthesislarge language modelsdivide and conquerautomated theorem provingsoundnessinteractive theorem provers

0 comments

The pith

Cobblestone automates Coq proof synthesis by using LLMs to decompose goals into subgoals, automatically isolating successful subproofs, and iterating until a complete sound proof is built.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Cobblestone as a divide-and-conquer method that calls on a large language model to propose candidate proofs, which are then used to split the original theorem into smaller subgoals. Successful subproofs are detected automatically from the model output, and the process repeats on whatever remains until a full proof is assembled. Because the final proof is checked by the Coq kernel, soundness holds even when the model itself is unreliable. The approach runs fully automatically on open-source Coq projects and records higher success rates than prior non-LLM and LLM-based tools while costing about one dollar and fifteen minutes per theorem on average. When supplied with external structure or lemmas, success reaches 58 percent.

Core claim

Cobblestone decomposes proof synthesis into an iterative loop: an LLM generates a candidate proof script that breaks the current goal into subgoals; the system extracts and verifies any subgoals that were successfully discharged; the remaining open subgoals become the new targets; and the loop continues until every subgoal is closed, at which point the assembled script constitutes a machine-checked, sound Coq proof.

What carries the argument

The divide-and-conquer iteration that uses LLM-generated scripts to split goals, automatically identifies proven subgoals, and recurses on the unsolved remainder.

If this is right

Fully automatic runs outperform state-of-the-art non-LLM tools on four open-source Coq benchmarks.
The method proves theorems that other LLM-based tools leave unproven on multiple benchmarks.
On several benchmarks it also exceeds the success rates of other LLM tools.
Average cost is $1.25 and average runtime is 14.7 minutes per theorem.
When given external input such as a proof skeleton or relevant lemmas, success reaches up to 58 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition loop could be ported to other interactive theorem provers that expose subgoal structure.
Pairing the method with stronger future language models would raise the fraction of theorems solved before any external oracle is needed.
User interfaces that let developers supply high-level lemmas or proof outlines could turn Cobblestone into a practical assistant rather than a fully automatic tool.
If the iteration converges reliably, the technique reduces the manual effort required to verify properties of real software libraries.

Load-bearing premise

LLM outputs can be parsed into subgoals whose successful proofs are reliably detected so that repeated decomposition eventually yields a complete, checkable proof.

What would settle it

A fresh benchmark of Coq theorems drawn from projects absent from training data where the fraction of theorems proved by Cobblestone falls below the best non-LLM baseline or the best prior LLM baseline.

read the original abstract

Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but requires significant effort and expertise. Machine learning can automatically synthesize proofs, but such tools are able to prove only a fraction of desired software properties. We introduce Cobblestone, a divide-and-conquer approach for proof synthesis. Cobblestone uses a large language model (LLM) to generate potential proofs, uses those proofs to break the problem into simpler parts, automatically identifies which of those parts were successfully proven, and iterates on the remaining parts to build a correct proof that is guaranteed to be sound, despite the reliance on unsound LLMs. We evaluate Cobblestone on four benchmarks of open-source Coq projects, controlling for training data leakage. Fully automatically, Cobblestone outperforms state-of-the-art non-LLM tools, and proves many theorems that other LLM-based tools cannot, and on many benchmarks, outperforms them. Each Cobblestone run costs only $1.25 and takes 14.7 minutes, on average. Cobblestone can also be used with external input, from a user or another tool, providing a proof structure or relevant lemmas. Evaluated with such an oracle, Cobblestone proves up to 58% of theorems. Overall, our research shows that tools can make use of partial progress and external input to more effectively automate formal verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cobblestone's divide-and-conquer loop for turning partial LLM outputs into sound Coq proofs is the main new piece, with reported gains over non-LLM baselines on four benchmarks.

read the letter

The key point is that Cobblestone decomposes LLM-generated proof attempts into subgoals, keeps only the ones the proof assistant accepts, and iterates on the rest. This keeps the final proof sound even though the LLM itself is not. The abstract positions this as distinct from prior direct synthesis work because it explicitly reuses partial successes and can take external structure or lemmas as input. On the four Coq benchmarks with leakage controls, it claims to beat existing non-LLM tools outright and to succeed on theorems that other LLM tools miss, all at roughly $1.25 and 15 minutes per run on average. With an oracle it reaches 58 percent. Those numbers are specific enough to be checked, and the soundness argument follows directly from routing everything through the assistant's checker. The approach therefore makes practical use of whatever partial progress the model can deliver without introducing false proofs. The main limitation right now is that only the abstract is available, so the exact decomposition rules, how subgoals are extracted and matched, and the full experimental tables are not visible. That leaves open questions about how often the iteration actually terminates, whether the automatic success filter has false positives, and how sensitive the gains are to the choice of LLM or benchmark split. The central performance claims rest on those unreported details. The paper targets people already working on Coq automation and LLM proof tools. Anyone tracking practical progress in that niche would want to see the full evaluation and method description. It is coherent on its own terms and makes falsifiable claims against existing systems, so it should go to peer review rather than a desk reject.

Referee Report

0 major / 1 minor

Summary. The manuscript introduces Cobblestone, a divide-and-conquer method for proof synthesis in Coq using large language models (LLMs). The approach generates potential proofs with an LLM to break down problems, identifies successful subproofs automatically, and iterates on remaining parts to construct a sound proof verified by the proof assistant. It reports outperforming state-of-the-art non-LLM tools on four benchmarks of open-source Coq projects (with training data leakage controls), proving theorems that other LLM-based tools cannot, at an average cost of $1.25 and 14.7 minutes per run. With external oracle input, it achieves up to 58% theorem proving success.

Significance. If the empirical claims hold under detailed scrutiny, this work would be significant for the formal verification community. It demonstrates a practical way to leverage the strengths of LLMs for proof generation while maintaining soundness guarantees through the proof assistant, addressing a key limitation of pure LLM-based approaches. The reported outperformance on benchmarks, low computational cost, and ability to incorporate external input suggest potential for broader adoption in automating parts of formal verification tasks.

minor comments (1)

[Abstract] The abstract provides a high-level overview but lacks specific details on the benchmarks used, the exact baselines compared against, or the statistical significance of the performance improvements.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of our manuscript and for recognizing its potential significance to the formal verification community. The report lists no specific major comments, so we have no individual points to address. We are prepared to provide clarifications or additional experiments if the referee has further questions.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical systems paper describing a divide-and-conquer proof synthesis tool. Its central claims concern measured performance on external Coq benchmarks and a soundness guarantee supplied by the proof assistant rather than the LLM. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract. The described workflow (LLM suggestions decomposed and verified by the assistant) is externally falsifiable against independent benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review provides limited visibility into assumptions; the approach rests on domain assumptions about LLM utility for partial proofs and automated subgoal identification.

axioms (2)

domain assumption LLM-generated partial proofs permit decomposition into subgoals that can be automatically verified as successful or not
Central to the divide-and-conquer iteration described in the abstract.
domain assumption The iteration process can reach a complete sound proof from partial progress without additional unstated mechanisms
Required for the guarantee of soundness and the reported success rates.

pith-pipeline@v0.9.0 · 5764 in / 1446 out tokens · 59726 ms · 2026-05-23T18:08:27.544827+00:00 · methodology

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Learning Method for Symbolic Systems Using Large Language Models
cs.SE 2026-05 unverdicted novelty 7.0

LLM2Ltac mines symbolic tactics from 11,725 Coq theorems using LLMs and integrates them into CoqHammer, improving proof rates by 23.87% on 6,199 theorems from four large verification projects.
Certified Program Synthesis with a Multi-Modal Verifier
cs.SE 2026-04 unverdicted novelty 7.0

LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, out...
The Search for Constrained Random Generators
cs.PL 2025-11 unverdicted novelty 7.0

A Lean library called Palamedes uses synthesis rules from generator semantics and catamorphism-anamorphism rewriting to automatically produce correct constrained random generators.
AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification
cs.SE 2026-05 unverdicted novelty 6.0

AutoSOUP automates component-level memory-safety verification by generating Safety-Oriented Unit Proofs via three techniques and a hybrid LLM-plus-program-synthesis architecture called LLM-As-Function-Call.
PROMISE: Proof Automation as Structural Imitation of Human Reasoning
cs.LO 2026-04 unverdicted novelty 6.0

PROMISE reframes automated proof generation as stateful search over structural embeddings of proof states, outperforming prior LLM-based systems by up to 26 points on the seL4 benchmark.
Understanding and Improving Automated Proof Synthesis for Interactive Theorem Provers
cs.LO 2026-04 unverdicted novelty 5.0

A pattern-guided tactic search method improves automated proof synthesis success rates by an average of 8.05% and achieves a 20% increase on previously unproven theorems.
On Reasoning-Centric LLM-based Automated Theorem Proving
cs.SE 2026-04 unverdicted novelty 5.0

ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.