Exponential Sample Complexity Separation between Flat and Hierarchical Agentic Theorem Provers

Sho Sonoda; Shunta Akiyama; Yuya Uezato

arxiv: 2602.10512 · v2 · submitted 2026-02-11 · 💻 cs.LG · cs.LO· stat.ML

Exponential Sample Complexity Separation between Flat and Hierarchical Agentic Theorem Provers

Sho Sonoda , Shunta Akiyama , Yuya Uezato This is my paper

Pith reviewed 2026-05-16 02:24 UTC · model grok-4.3

classification 💻 cs.LG cs.LOstat.ML

keywords theorem provingimitation learninghierarchical agentssample complexityproof searchMDP modelingagentic systems

0 comments

The pith

Hierarchical agentic theorem provers achieve exponentially lower sample complexity than flat ones when proofs contain reusable substructures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models proof search as a deterministic finite-horizon MDP and derives sample complexity bounds for offline imitation learning from teacher proof traces. It shows that a hierarchical learner, which represents proofs as reusable DAGs, avoids the exponential duplication of hard local arguments that a flat learner encounters in inlined traces. This leads to exponentially smaller sufficient-sample certificates for the hierarchical approach. A reader would care because it explains statistically why introducing intermediate lemmas can help rather than hinder learned provers in domains with repeated subproofs.

Core claim

When flattening duplicates the same hard local argument exponentially many times, the sufficient-sample certificate produced by our bounds can be exponentially smaller for the hierarchical learner that predicts a reusable proof DAG and solves each shared block once.

What carries the argument

The sufficient-sample certificate from imitation learning bounds depending on average teacher proof length, local action predictability, and student accuracy.

If this is right

Flat learners suffer exponential blowup in training data size due to repeated subproofs.
Hierarchical learners solve shared blocks once, keeping sample needs linear in unique subproblems.
The separation is largest when the same difficult local prediction appears at many depths in the proof tree.
Verifier-based theorem proving benefits from structure that allows reuse in the learned policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If proof datasets have high reuse, hierarchical methods should show clear sample efficiency gains in experiments.
This framework could be extended to compare online reinforcement learning versions of flat and hierarchical provers.
Domains like algebraic geometry with many similar lemmas might exhibit the largest practical benefits.

Load-bearing premise

The training data consists of successful verified proof traces with sufficiently predictable local actions and reusable structure.

What would settle it

A dataset of theorem proofs where subproofs repeat exponentially but the empirical sample complexity for flat and hierarchical learners shows no exponential gap.

read the original abstract

Agentic theorem provers often introduce intermediate lemmas, proof sketches, or subgoal decompositions before returning to tactic-level search. This can look like an expensive detour: if proving lemmas is itself hard, why should a learned prover spend effort there? We give a statistical learning answer. Instead of worst-case proof complexity over all formulas, we study the biased data distribution produced by a teacher prover: initial theorem states together with successful verified proof traces. We model proof search as a deterministic finite-horizon MDP and analyze offline imitation learning from those traces. The success bounds depend on the average length of teacher proofs, how predictable the teacher's next action is, and how accurately the student learns that local prediction problem. A flat student learns from fully inlined traces, so repeated subproofs appear many times in its training and test-time certificate. A hierarchical student instead predicts a reusable proof DAG and solves each shared block once. When flattening duplicates the same hard local argument exponentially many times, the sufficient-sample certificate produced by our bounds can be exponentially smaller for the hierarchical learner. This gives a concrete statistical mechanism by which reusable proof structure helps verifier-based theorem proving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives an exponential sample-complexity gap favoring hierarchical over flat imitation learners on teacher proof traces, via standard bounds on a deterministic MDP model.

read the letter

The main takeaway is that hierarchical structure yields an exponential reduction in the samples needed for reliable imitation learning when teacher traces contain repeated subproofs. Flat learners see those blocks duplicated in long trajectories, which inflates the per-step accuracy requirement and therefore the total sample count; hierarchical learners collapse the shared blocks and avoid the blow-up. The argument follows from applying existing offline imitation-learning bounds to the two representations of the same teacher data.

Referee Report

2 major / 2 minor

Summary. The paper models proof search as a deterministic finite-horizon MDP whose expert trajectories are successful verified proof traces from a teacher prover. It applies standard offline imitation-learning sample-complexity bounds to these traces and shows that a flat learner, which trains on fully inlined trajectories, requires exponentially more samples than a hierarchical learner that predicts a reusable proof DAG and solves each shared block once, because flattening duplicates hard local subproofs many times and thereby inflates the required per-step accuracy to keep total failure probability below a fixed threshold.

Significance. If the exponential separation holds under the stated modeling assumptions, the result supplies a concrete statistical mechanism—rooted in trajectory length and local predictability—by which reusable hierarchical structure improves sample efficiency for verifier-based theorem proving. The derivation re-uses existing imitation-learning bounds rather than introducing new parameters, yielding falsifiable predictions that depend only on average proof length and the teacher’s local action predictability.

major comments (2)

§3 (MDP modeling) and the derivation leading to the main separation bound: the exponential gap is obtained by letting the flat trajectory length grow linearly with the duplication factor while the hierarchical version collapses shared blocks; the paper must exhibit the explicit dependence of the per-step error tolerance on this length (including all constants from the underlying imitation-learning theorem) to confirm that the gap is exponential rather than merely polynomial.
The paragraph following the abstract’s description of the bounds: the claim that the hierarchical certificate is “exponentially smaller” rests on the assumption that local actions remain sufficiently predictable across duplicated subproofs; an explicit statement of the minimal predictability threshold (or a counter-example when predictability drops) is needed because this is load-bearing for the separation to materialize outside the deterministic-MDP idealization.

minor comments (2)

Notation section: the symbols for the teacher distribution, the local prediction problem, and the flat versus hierarchical trajectory lengths should be introduced with a single consolidated table or equation block to avoid repeated re-definition when the two learners are compared.
Figure 1 (or the illustrative proof-DAG diagram): the caption should explicitly label which nodes are collapsed in the hierarchical case and which remain duplicated in the flat case, so that the length difference is immediately visible without cross-referencing the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We agree that making the dependence on trajectory length and constants explicit, as well as clarifying the predictability threshold, will strengthen the paper. We address each major comment below and will incorporate the requested changes in the revised version.

read point-by-point responses

Referee: §3 (MDP modeling) and the derivation leading to the main separation bound: the exponential gap is obtained by letting the flat trajectory length grow linearly with the duplication factor while the hierarchical version collapses shared blocks; the paper must exhibit the explicit dependence of the per-step error tolerance on this length (including all constants from the underlying imitation-learning theorem) to confirm that the gap is exponential rather than merely polynomial.

Authors: We agree that the current presentation would benefit from a fully expanded derivation. In the revision we will restate the underlying offline imitation-learning theorem (including all logarithmic factors and constants) in §3 and explicitly derive the per-step tolerance ε(L, δ) = Θ(δ / L) required to keep total failure probability below δ. Substituting the flat length L_flat = Θ(d · L_hier) for duplication factor d then yields an exponential gap in the sample-complexity certificate, which we will display side-by-side with the hierarchical bound. revision: yes
Referee: The paragraph following the abstract’s description of the bounds: the claim that the hierarchical certificate is “exponentially smaller” rests on the assumption that local actions remain sufficiently predictable across duplicated subproofs; an explicit statement of the minimal predictability threshold (or a counter-example when predictability drops) is needed because this is load-bearing for the separation to materialize outside the deterministic-MDP idealization.

Authors: We will add an explicit statement immediately after the abstract paragraph (and again in §3) giving the minimal local predictability threshold p_min(L, δ) such that the hierarchical certificate remains exponentially smaller whenever the teacher’s per-step prediction error is below 1 − p_min. We will also include a short paragraph noting that if local predictability degrades with duplication (e.g., due to context-dependent hardness), the separation may degrade to polynomial; this is already implicit in the deterministic-MDP model but will now be stated explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard bounds applied to explicit MDP model

full rationale

The paper models proof search as a deterministic finite-horizon MDP whose expert trajectories are verified teacher proofs, then invokes standard offline imitation-learning sample-complexity bounds whose inputs are average trajectory length, local action predictability, and student accuracy. The exponential separation is obtained by substituting the (by-construction) longer flat trajectories versus shorter hierarchical DAG representations into those external bounds; no equation in the paper reduces a fitted parameter or target quantity back to itself, and no load-bearing premise rests on a self-citation chain. The derivation is therefore self-contained against the stated modeling assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: modeling proof search as a deterministic finite-horizon MDP and restricting analysis to the biased distribution of successful teacher traces. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Proof search is modeled as a deterministic finite-horizon MDP
Used to analyze offline imitation learning from proof traces.
domain assumption Training data is the biased distribution of successful verified proof traces from a teacher prover
Replaces worst-case analysis over all formulas.

pith-pipeline@v0.9.0 · 5508 in / 1229 out tokens · 67827 ms · 2026-05-16T02:24:56.159065+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model proof search as a deterministic finite-horizon MDP... latent proof DAG Z... cut-elimination expands a DAG of depth D into a cut-free tree of size Ω(Λ^D)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

success probability lower bounds... sequential Rademacher/covering complexity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.