Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

Guchan Li; Hongning Wang; Rui Tian

arxiv: 2604.18587 · v2 · pith:FEFCKLZGnew · submitted 2026-03-13 · 💻 cs.LG · cs.AI· cs.LO· cs.PL

Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

Guchan Li , Rui Tian , Hongning Wang This is my paper

Pith reviewed 2026-05-15 12:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LOcs.PL

keywords formal theorem provingcompiler feedbacklearning-to-refinePutnamBenchlarge language modelstree searchverifier-guided reasoningproof compression

0 comments

The pith

Compiler outputs compress diverse proof attempts into a compact set of failure modes that support local corrections and stronger theorem proving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language-model provers can be strengthened by exploiting how compilers turn many different failed proof attempts into a small collection of clear, structured error signals. This compression supports a tree-search process that fixes mistakes locally using direct verifier feedback instead of tracking long sequences of prior attempts. The result is more efficient exploration that scales across model sizes while keeping test-time compute comparable to baselines. A sympathetic reader would care because current top provers often demand prohibitive roll-outs or context lengths, and this structure offers a path to better performance without those costs.

Core claim

Compilers map a broad space of incorrect proof attempts to a compact set of structured failure modes. The authors build a learning-to-refine framework that performs tree search and corrects errors locally conditioned on explicit verifier feedback from those modes, avoiding the need to accumulate long histories of attempts. This approach consistently amplifies base provers of varying scales and reaches state-of-the-art performance on PutnamBench among publicly reported models of roughly 8B and 32B parameters under matched test-time budgets.

What carries the argument

Tree search that performs local error correction conditioned on compact compiler-derived failure modes.

If this is right

Base provers at different scales gain improved reasoning capabilities.
Proof exploration proceeds efficiently without maintaining long histories of attempts.
Performance improvements hold consistently across model sizes.
State-of-the-art results appear on PutnamBench for 8B and 32B models at comparable compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compression of verifier signals could apply to other structured reasoning tasks such as code synthesis where failure modes are also limited.
One could test whether predicting the compact failure modes ahead of time reduces search cost further.
Combining the local-correction tree search with broader exploration strategies might yield additional gains on harder benchmarks.

Load-bearing premise

Compilers reliably reduce many different proof attempts to a small number of structured failure modes that remain useful for guiding corrections.

What would settle it

Evaluating the method on PutnamBench under the stated test-time budgets and observing that it fails to match or exceed the success rates of comparable baselines.

Figures

Figures reproduced from arXiv: 2604.18587 by Guchan Li, Hongning Wang, Rui Tian.

**Figure 1.** Figure 1: Overview of the proposed Learning-to-Refine framework. Compiler messages map the space of diverse proof attempts to a compact set of structured failure modes, which enable efficient learning and search for the correct proofs. We trained a neural value model to efficiently guide test-time proof generation and refinement process. mal reasoning capability into a generalizable form of policy that can both gene… view at source ↗

**Figure 2.** Figure 2: Refinement training data synthesis pipeline. We use Kimina Lean Server (Santos et al., 2025) for fast program verification. are reported in Appendix B.2. Our proposed data synthesis pipeline is hallucination-free and cost-effective. All corrected solutions are validated by the Lean compiler, and the refinement data is derived directly from the failures of the prover itself. Fine-tuning the prover on this … view at source ↗

**Figure 3.** Figure 3: Visualization of programs generated by augmented Kimina-8B prover on problem Putnam-1965-b5. Each gray circle represents a failed Lean proof in program space. Green circle corresponds to a correct solution. The successful search trajectory is highlighted in orange. an intuitive geometric perspective on this separation, we further visualize the sampled Lean programs using multidimensional scaling (MDS) (Mea… view at source ↗

**Figure 4.** Figure 4: Visualization of random tree search strategy and value-guided tree search on problem Putnam-1971-b1 under augmented Goedel-32B prover. Figure 4b visualizes the values for each tree node using colormap. Red nodes correspond to broken nodes (where LLM’s proof generation process failed). Green nodes represent correct solutions. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

read the original abstract

Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we address this scalability bottleneck by exploiting an informative structure in formal verification: the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes. We introduce a learning-to-refine framework that leverages this compression to perform efficient learning and proof exploration. We perform tree search that corrects errors locally conditioned on explicit verifier feedback, thereby circumventing the costs associated with accumulating a long history of proof attempts. Extensive evaluations show that our method consistently amplifies the reasoning capabilities of base provers across varying scales. Notably, our approach achieves state-of-the-art performance on PutnamBench among publicly reported $\sim$8B and $\sim$32B parameter models under comparable test-time budgets, offering a scalable paradigm for next-generation verifier-guided reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses compiler failure modes to drive local corrections in tree search and claims this lets mid-sized models hit SOTA on PutnamBench without the usual rollout costs.

read the letter

The main takeaway is that they turn the compiler into a compressor: instead of letting the model keep long histories of failed proof attempts, they extract structured error signals and use those to guide local fixes during search. This learning-to-refine loop is the concrete new piece. It differs from plain massive sampling or context stuffing by conditioning corrections directly on verifier output at each node. They show gains on PutnamBench for both the 8B and 32B models while claiming comparable test-time budgets, which is the result that would matter to people already running verifier-guided provers. The experiments appear to test across model scales and report consistent amplification of the base models, so the method is not just a one-off trick on a single size. That part is worth noting because it gives a practical lever for efficiency. The soft spot is the budget accounting. The stress-test concern is fair: if every node triggers a compiler call, the total cost could exceed what the baselines actually spent, and the paper needs to show wall-clock or token breakdowns that include those calls. Without that, the performance delta is harder to attribute cleanly to the compression idea. The assumption that failure modes stay compact and reusable across theorems also needs more evidence; if it only holds on PutnamBench, the generality claim weakens. This is for groups already working on LLM theorem provers who care about search efficiency rather than raw scale. It deserves a serious referee because the core mechanism is simple enough to test and the reported gains are large enough to check. I would send it to review rather than desk reject, with the expectation that the authors tighten the compute comparison in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a 'Compile to Compress' framework for LLM-based formal theorem proving. It observes that compilers compress diverse proof attempts into a compact set of structured failure modes and uses this to drive a tree-search procedure that performs local corrections conditioned on explicit verifier feedback, avoiding accumulation of long proof histories. The central empirical claim is state-of-the-art performance on PutnamBench among publicly reported ~8B and ~32B models under comparable test-time budgets.

Significance. If the budget-equivalence claim is substantiated, the work offers a concrete mechanism for improving verifier-guided reasoning efficiency by exploiting the natural compression performed by compilers. This could reduce the need for massive roll-outs or extended contexts and provides a reproducible direction for scaling formal provers that is grounded in the structure of the verification oracle rather than purely heuristic search.

major comments (2)

[Abstract / Experimental section] Abstract and experimental evaluation: the SOTA claim on PutnamBench for ~8B and ~32B models rests on the assertion of 'comparable test-time budgets.' Because the proposed tree search invokes compiler calls at every node, the manuscript must explicitly demonstrate (via wall-clock or token accounting) that these costs are included in the reported budgets and that the baselines received equivalent verifier feedback; without this accounting the performance delta cannot be attributed to the compile-to-compress mechanism.
[Method description] The central modeling assumption that compilers map a vast space of proof attempts to a compact set of failure modes is stated qualitatively but not quantified. A table or figure reporting the observed number of unique failure modes, their frequency distribution, and how this compactness is leveraged in the search policy would be required to substantiate the efficiency argument.

minor comments (2)

[Method] Notation for the tree-search procedure and the conditioning on failure modes should be introduced with a single consistent diagram or pseudocode block rather than scattered textual descriptions.
[Abstract / Experiments] The abstract refers to 'extensive evaluations' and 'varying scales' but provides no table or section reference; a summary table of model sizes, baselines, and key metrics should appear early in the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of substantiating our efficiency claims and modeling assumptions. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Experimental section] Abstract and experimental evaluation: the SOTA claim on PutnamBench for ~8B and ~32B models rests on the assertion of 'comparable test-time budgets.' Because the proposed tree search invokes compiler calls at every node, the manuscript must explicitly demonstrate (via wall-clock or token accounting) that these costs are included in the reported budgets and that the baselines received equivalent verifier feedback; without this accounting the performance delta cannot be attributed to the compile-to-compress mechanism.

Authors: We agree that explicit cost accounting is necessary to support the budget-equivalence claim. In the revised manuscript we will add a new subsection (and accompanying table) in the experimental evaluation that reports wall-clock time and token usage for the full tree-search procedure, explicitly breaking out the compiler invocation costs at every node. We will also document the exact verifier-feedback budget allocated to each baseline so that the performance differences can be attributed to the compile-to-compress mechanism rather than unequal oracle access. revision: yes
Referee: [Method description] The central modeling assumption that compilers map a vast space of proof attempts to a compact set of failure modes is stated qualitatively but not quantified. A table or figure reporting the observed number of unique failure modes, their frequency distribution, and how this compactness is leveraged in the search policy would be required to substantiate the efficiency argument.

Authors: We acknowledge that a quantitative characterization would strengthen the central modeling claim. The revised paper will include a new figure and table that report (i) the number of unique failure modes observed across sampled proof attempts on PutnamBench, (ii) their empirical frequency distribution, and (iii) an analysis of how the observed compactness is exploited by the local-refinement policy (i.e., conditioning corrections only on the failure mode rather than full proof histories). This addition will provide direct empirical support for the efficiency argument. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses external compiler feedback as independent signal

full rationale

The paper's core claim rests on an empirical observation that compilers compress diverse proof attempts into structured failure modes, followed by a tree-search refinement procedure conditioned on verifier outputs. This structure is not self-defined or fitted by construction; the compiler is an external oracle whose outputs are not derived from the model's parameters or predictions. No equations reduce a 'prediction' to a fitted input, no uniqueness theorem is imported from self-citation, and the SOTA result on PutnamBench is presented as an empirical outcome under stated test-time budgets rather than a definitional equivalence. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the framework builds on standard compiler outputs and LLM tree search without introducing new postulated objects.

pith-pipeline@v0.9.0 · 5471 in / 1045 out tokens · 39796 ms · 2026-05-15T12:38:53.288785+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

failure-mode driven refinement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.