Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

Avrim Blum; Dravyansh Sharma; Kiriaki Fragkia; Maria-Florina Balcan; Zhiyuan Li

arxiv: 2603.03538 · v3 · pith:5CYHBLTGnew · submitted 2026-03-03 · 💻 cs.LG

Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

Maria-Florina Balcan , Avrim Blum , Kiriaki Fragkia , Zhiyuan Li , Dravyansh Sharma This is my paper

Pith reviewed 2026-05-21 11:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords chain-of-thoughtverifier learningonline learningLittlestone dimensionsoundness errorscompleteness errorsgeneratorsreasoning

0 comments

The pith

Chain-of-thought verifiers are online learnable with tight mistake bounds from extensions of the Littlestone dimension that separate soundness and completeness errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up an online learning problem for verifiers that judge sequences of reasoning steps produced by language models. The central difficulty is that the verifier's own feedback changes the distribution of reasoning traces the generator produces next. To handle this, the authors define two kinds of mistakes with different costs: soundness mistakes that let incorrect steps pass, and completeness mistakes that wrongly reject correct steps. They introduce new dimension measures that extend the classic Littlestone dimension to these asymmetric costs and prove that these measures exactly determine the worst-case number of mistakes any online learner must suffer. The same measures yield algorithms that trace out the best possible trade-off curves and that convert a collection of weak generators into one that produces correct long reasoning traces with low error.

Core claim

The central claim is that novel extensions of the Littlestone dimension, which separately count soundness and completeness mistakes, fully characterize the online learnability of chain-of-thought verifiers in the realizable case and directly yield optimal algorithms for the Pareto frontier of total mistakes versus soundness budget as well as for any linear combination of the two costs.

What carries the argument

Asymmetric extensions of the Littlestone dimension that track soundness versus completeness mistakes separately for online verifier learning.

If this is right

Optimal algorithms recover the smallest total mistakes achievable for any fixed budget of soundness mistakes.
The same algorithms minimize any chosen linear combination of soundness and completeness costs.
The learned verifier can be inserted into the generation loop to raise accuracy of weak generators and produce reasoning traces longer or more complex than those seen in initial training.
Under the mild probability assumption, the combined system yields a strong generator whose final outputs have both low error rate and low abstention rate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same asymmetric online-learning machinery could be applied to other iterative generation settings where feedback creates distribution shift, such as code synthesis or multi-step planning.
The Pareto-frontier construction offers a concrete way to tune verifiers toward safety-critical priorities by capping soundness mistakes while accepting higher completeness cost.
Empirical checks on existing reasoning benchmarks could test whether the dimension-based bounds remain predictive once large language models replace the abstract weak generators.

Load-bearing premise

At least one of the available generators produces the next correct reasoning step with some positive minimum probability.

What would settle it

Run the proposed online learner on a sequence of reasoning traces where the observed number of mistakes exceeds the bound predicted by the new dimension, or where the verifier fails to reduce generator error below the level guaranteed under the minimal-probability assumption.

read the original abstract

Large Language Models (LLMs) with chain-of-thought generation have demonstrated great potential for solving complex reasoning and planning tasks. However, the output of current LLMs is not fully reliable and needs careful verification. Even if LLMs get more accurate over time, learned verifiers can help increase trust, enforce safety constraints, and ensure alignment with personal preferences. A major challenge in learning verifiers, however, especially when their output will be used by the generator to improve its reasoning, is that the feedback loop between generator and verifier may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness errors (failure in catching errors in a reasoning trace) and completeness errors (flagging correct reasoning steps as wrong), we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as for minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak generators, and enable generation of proofs beyond what they were initially trained on. With the mild assumption that one of the generators can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong generator with small error and abstention rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives tight online-learning bounds via asymmetric Littlestone extensions for soundness/completeness trade-offs and then applies the verifier to boost generators under one mild assumption.

read the letter

The main point is that this work adapts online learning to chain-of-thought verifiers by treating soundness errors (letting a bad step through) and completeness errors (rejecting a good step) asymmetrically. They define new dimension notions that extend the classic Littlestone dimension to this setting and prove matching upper and lower bounds on mistakes in the realizable case. They also supply algorithms that compute the Pareto frontier for a fixed soundness budget and that minimize a linear combination of the two error types. That part of the paper is the cleanest contribution and looks like a substantive technical step rather than a routine application. The online model directly handles the distribution shift that arises when the verifier's feedback changes what the generator produces next, which is a realistic modeling choice. The boosting section then shows that a learned verifier can turn a collection of weak generators into a stronger one that produces longer correct traces, provided at least one generator has a fixed positive probability of emitting the right next step. The assumption is stated plainly and the argument stays within the realizable online framework, so there is no obvious circularity. The formal results appear to rest on standard online-learning machinery plus the new dimension definitions, and the citation pattern tracks the relevant prior work on Littlestone dimension and recent verifier papers. One softer spot is the quantitative behavior of the final error and abstention rates once the mild probability assumption is plugged in. The abstract claims small rates, but if the analysis only establishes finite bounds without an explicit dependence on that probability or on the speed of the online updates, the practical improvement could shrink under modest values or under continued shift. The paper does not appear to contain empirical checks, so the claims stay purely theoretical. This is written for researchers who work on the foundations of reliable reasoning systems or who want to import online-learning tools into LLM settings. A reader who already knows Littlestone dimension and cares about asymmetric costs will get the most out of the dimension characterizations and algorithms. The work is grounded enough and the problem timely enough that it deserves a serious referee, even if the boosting rates need extra checking during review. I would send it out for peer review.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes an online learning framework for chain-of-thought verifiers that addresses distribution shift arising from the generator-verifier feedback loop. It introduces extensions of the Littlestone dimension to tightly characterize mistake bounds for verifiers in the realizable setting, distinguishing soundness errors (missing incorrect steps) from completeness errors (flagging correct steps). Optimal algorithms are provided for the Pareto frontier of total mistakes subject to a soundness-mistake budget and for minimizing linear combinations of asymmetric costs. The learned verifiers are further shown to boost a collection of weak generators, yielding a strong generator with small error and abstention rates under the mild assumption that at least one generator produces the next correct reasoning step with probability p > 0.

Significance. If the characterizations and algorithms are correct, the work supplies a theoretical basis for learning verifiers that remain effective under online distribution shift, with direct relevance to reliable LLM reasoning, safety constraints, and self-improving systems. The asymmetric Littlestone extensions and explicit Pareto algorithms constitute a clear theoretical advance; the boosting result, if made quantitatively rigorous, would add substantial practical value.

major comments (1)

[§5.2] §5.2 (generator boosting): The argument that verifier mistake bounds imply small generator error and abstention rates under the assumption that one generator succeeds with probability p > 0 does not supply an explicit dependence of the final rates on 1/p or on the relative speed of the online verifier updates. Without this, it is unclear whether the rates remain small for modest p once the verifier-induced distribution shift is taken into account.

minor comments (2)

[§2] The realizable-setting assumption is invoked repeatedly; a single consolidated statement of what is assumed about the data-generating process would improve readability.
[Figure 2] Figure 2 (Pareto curves) would benefit from error bars or a statement of the number of independent runs used to generate the plotted points.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for their thorough review and valuable feedback on our manuscript. The major comment raises an important point about the quantitative aspects of the generator boosting result, which we address below. We will revise the manuscript to provide the requested explicit dependencies.

read point-by-point responses

Referee: [§5.2] §5.2 (generator boosting): The argument that verifier mistake bounds imply small generator error and abstention rates under the assumption that one generator succeeds with probability p > 0 does not supply an explicit dependence of the final rates on 1/p or on the relative speed of the online verifier updates. Without this, it is unclear whether the rates remain small for modest p once the verifier-induced distribution shift is taken into account.

Authors: We thank the referee for this insightful comment. Upon re-examination, the current version of §5.2 presents the boosting result at a high level, showing that the verifier's online mistake bounds can be used to ensure that the effective distribution remains favorable for the strong generator, leading to small error and abstention rates. However, we acknowledge that an explicit functional dependence on 1/p and the update speed is not derived in the text. The proof relies on the fact that the number of verifier mistakes is bounded by the (asymmetric) Littlestone dimension, and each mistake affects the generator's effective success probability by a factor related to p. To address this, we will revise the section to include a new lemma that derives the explicit bound: the generator error rate is at most O(M / p + ε), where M is the verifier mistake bound and ε accounts for the online convergence rate. This will make clear that for p fixed and M small relative to p, the rates remain small even under distribution shift. We believe this addition will clarify the result without altering the main claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces novel extensions of the Littlestone dimension to characterize asymmetric soundness and completeness mistakes for online verifier learning in the realizable setting, then derives optimal algorithms for Pareto frontiers and linear cost minimization. These steps rest on standard online learning theory with explicit new definitions and bounds rather than self-referential inputs. The generator-boosting result is conditioned on a stated mild external assumption (one generator produces correct steps with minimal probability p > 0) and does not reduce to fitted parameters, self-citations, or ansatzes; the final error/abstention claims follow from the verifier mistake bounds under that assumption without circular reduction. No load-bearing self-citation chains or renamings of known results are present.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the realizable setting (existence of a perfect verifier) and the mild generator probability assumption; no free parameters or invented entities are mentioned.

axioms (2)

domain assumption Realizable setting: there exists a perfect verifier consistent with the data.
Invoked to obtain tight mistake bounds via the extended Littlestone dimension.
domain assumption One generator produces the next correct step with positive minimal probability.
Required for the boosting result that converts weak generators into a strong one with low error.

pith-pipeline@v0.9.0 · 5839 in / 1413 out tokens · 51072 ms · 2026-05-21T11:21:01.646539+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting... SC-Ldim(H, k) := sup{m : there exists a (k,m)-difficult mistake tree}
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

With the mild assumption that one of the generators can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong generator with small error and abstention rates.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.