arxiv: 2604.12013 · v2 · submitted 2026-04-13 · 💻 cs.LG

Recognition: unknown

Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End

Steve Hanneke , Idan Mehalel , Shay Moran

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3

classification 💻 cs.LG

keywords sample complexityautoregressive modelschain-of-thoughtPAC learningnext-token predictionlearning theorygeneration length

0 comments

The pith

Chain-of-thought supervision makes sample complexity independent of generation length T

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the number of training examples needed to learn an unknown next-token generator that is run for T steps to produce an output sequence. Under end-to-end supervision that reveals only the final token, the required sample size can grow with T at essentially any rate between constant and linear, subject to mild conditions on the generator class. Under chain-of-thought supervision that reveals the entire intermediate sequence, the sample complexity stays fixed regardless of T. This contrast shows that access to intermediate tokens can remove the cost of longer generations entirely.

Core claim

The authors establish a taxonomy for how sample complexity scales with T: end-to-end learning can realize any growth rate r(T) between constant and linear under mild conditions, while chain-of-thought learning yields sample complexity independent of T altogether. Combined with the prior linear upper bound, this gives a nearly complete characterization of the dependence on generation length and resolves several open questions about the role of intermediate supervision.

What carries the argument

New combinatorial tools that bound the sample complexity of the hypothesis classes induced by autoregressive next-token processes under full versus final-only supervision.

If this is right

End-to-end supervision permits sample complexity to scale in essentially any way with T between constant and linear.
Chain-of-thought supervision renders sample complexity completely independent of T.
The results supply a full characterization of learnability dependence on generation length.
New combinatorial tools are developed to prove the upper and lower bounds on these rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For very long reasoning chains, full intermediate labels may be necessary to keep data requirements from growing.
The separation could guide the design of training data that includes partial chains when full supervision is expensive.
Empirical tests on specific generator families could check whether observed scaling matches the predicted arbitrary rates.

Load-bearing premise

The claim that end-to-end sample complexity can realize arbitrary rates between constant and linear rests on the existence of hypothesis classes satisfying mild conditions that allow constructions achieving each rate.

What would settle it

A concrete next-token generator class for which end-to-end sample complexity grows linearly with T while chain-of-thought sample complexity remains constant would confirm the separation; a class where both regimes exhibit the same rate would falsify the richness result.

Figures

Figures reproduced from arXiv: 2604.12013 by Idan Mehalel, Shay Moran, Steve Hanneke.

**Figure 2.** Figure 2: A geometric illustration of stable compression for maximum-margin linear classification. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The tree TT (x) for T = 2, x = 0. 2. u and v lie on the same level in T1 if and only if ι(u) and ι(v) lie on the same level in T2. 5.4 Learning theory tools Some of our proofs rely on a broad collection of classic learning theory techniques developed through decades of research, including VC-theory, multiclass learning, sample compression schemes, and more. We define all relevant terms and concepts, and st… view at source ↗

read the original abstract

Modern large language models generate text autoregressively, producing tokens one at a time. To study the learnability of such systems, Joshi et al. (COLT 2025) introduced a PAC-learning framework for next-token generators, the primitive underlying autoregressive models. In this framework, an unknown next-token generator maps a sequence of tokens to the next token and is iteratively applied for $T$ steps, producing a chain of tokens whose final token constitutes the model's output. The learning task is to learn the input-output mapping induced by this autoregressive process. Depending on the available supervision, training examples may reveal only the final output (End-to-End supervision) or the entire generated chain (Chain-of-Thought supervision). This raises two natural questions: how the sample complexity depends on the generation length $T$, and how much Chain-of-Thought supervision can reduce this dependence. In this work we give a nearly complete answer to both questions by uncovering a taxonomy of how the sample complexity scales with $T$. For End-to-End learning, we show that the landscape is remarkably rich: subject to mild conditions, essentially any growth rate $r(T)$ between constant and linear can arise as the sample complexity, and combined with the linear upper bound of Joshi et al., this yields an essentially complete characterization. In contrast, under Chain-of-Thought supervision we show that the sample complexity is independent of $T$, demonstrating that access to intermediate reasoning steps can eliminate the dependence on the generation length altogether. Our analysis introduces new combinatorial tools, and as corollaries we resolve several open questions posed by Joshi et al. regarding the dependence of learnability on the generation length and the role of Chain-of-Thought supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

End-to-end autoregressive learning can realize almost any sample complexity scaling with T between constant and linear, while chain-of-thought supervision removes the T dependence completely.

read the letter

This paper maps out the sample complexity landscape for learning autoregressive generators under two supervision regimes. For end-to-end supervision it constructs cases where the complexity grows with T at essentially any rate in the constant-to-linear range, under mild conditions on the next-token class. For chain-of-thought it shows the complexity reduces to that of learning the base generator and therefore stays independent of T. These results directly settle several open questions left by Joshi et al. and introduce combinatorial arguments that handle the chain dependencies cleanly. The constructions are explicit enough to demonstrate the claimed richness, and the reduction for chain-of-thought is straightforward once the framework is set up. The work stays grounded in the existing PAC setup without adding circular steps or self-referential bounds. The mild conditions are stated up front and the stress-test confirms the taxonomy holds without internal contradictions or hidden T dependence. One minor limitation is that the arbitrary scaling relies on those conditions, so the result describes the possible range rather than what happens for every natural generator class. Readers working on learning theory for sequential models will get the most from it, especially anyone extending PAC analyses to autoregressive or reasoning settings. The paper is technically solid enough to warrant a full referee process rather than a desk rejection.

Referee Report

0 major / 2 minor

Summary. The paper studies the sample complexity of PAC-learning autoregressive next-token generators under End-to-End supervision (only final output observed) versus Chain-of-Thought supervision (full generation chain observed). It establishes that, subject to mild conditions on the generator class, End-to-End sample complexity can realize essentially any growth rate r(T) in [constant, linear] via explicit constructions, yielding a nearly complete characterization when paired with the linear upper bound from Joshi et al. (COLT 2025). In contrast, Chain-of-Thought reduces directly to PAC-learning the fixed base generator and produces T-independent bounds. New combinatorial tools are introduced, and several open questions from Joshi et al. on length dependence and the role of intermediate supervision are resolved as corollaries.

Significance. If the central taxonomy and constructions hold, the work provides a precise theoretical account of how supervision regime affects learnability in autoregressive models, showing that access to intermediate steps can eliminate all dependence on generation length T. The explicit constructions for arbitrary r(T) and the combinatorial arguments resolving prior open questions are notable strengths; the results are parameter-free in their scaling statements and falsifiable via the stated mild conditions.

minor comments (2)

[§2.2] §2.2, Definition 3: the precise statement of the 'mild conditions' on the next-token generator class is referenced but not restated in full; including the exact formulation here would make the constructions in §4 self-contained.
[Table 1] Table 1: the row for 'arbitrary r(T)' lists the End-to-End lower bound but omits the matching upper-bound reference to Joshi et al.; adding the citation would clarify the 'nearly complete characterization' claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the assessment of its significance, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained against external benchmarks

full rationale

The paper extends the external Joshi et al. (COLT 2025) PAC framework for next-token generators with new combinatorial arguments and explicit constructions. The End-to-End taxonomy (any r(T) in [const, linear] under mild conditions on the generator class) and the CoT T-independence result are derived directly from these tools and do not reduce to fitted parameters, self-definitions, or load-bearing self-citations. The linear upper bound is imported from the cited prior work (non-overlapping authors), and the new lower-bound constructions and corollaries resolving Joshi et al. open questions are independent of the target claims. No step equates a prediction to its input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on the PAC-learning framework for next-token generators introduced by Joshi et al. together with new combinatorial arguments; no free parameters, ad-hoc axioms, or invented entities are indicated in the abstract.

axioms (1)

domain assumption The PAC-learning framework for next-token generators and iterative application for T steps holds as defined by Joshi et al.
The entire analysis is conducted inside this recently introduced framework.

pith-pipeline@v0.9.0 · 5620 in / 1231 out tokens · 39885 ms · 2026-05-10T15:41:08.779579+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
cs.LG 2026-05 unverdicted novelty 7.0

Online mistake bounds for autoregressive output learning can grow logarithmically with generation horizon M under end-to-end feedback but become independent of M with chain-of-thought trajectory access.

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · cited by 1 Pith paper

[1]

Cot information: Improved sam- ple complexity under chain-of-thought supervision.arXiv preprint arXiv:2505.15927,

[AML25] Awni Altabaa, Omar Montasser, and John Lafferty. Cot information: Improved sam- ple complexity under chain-of-thought supervision.arXiv preprint arXiv:2505.15927,

work page arXiv
[2]

On learning verifiers for chain-of-thought reasoning.arXiv preprint arXiv:2505.22650,

[BBLS25] Maria-Florina Balcan, Avrim Blum, Zhiyuan Li, and Dravyansh Sharma. On learning verifiers for chain-of-thought reasoning.arXiv preprint arXiv:2505.22650,

work page arXiv
[3]

arXiv preprint arXiv:2005.11818 , year=

COLT 2020 (9–12 July 2020). URL: https://proceedings.mlr.press/ v125/bousquet20a.html,arXiv:2005.11818. [DFSZ14] Thorsten Doliwa, Gaojian Fan, Hans Ulrich Simon, and Sandra Zilles. Recursive teaching dimension, vc-dimension and sample compression.The Journal of Machine Learning Research, 15(1):3107–3131,

work page arXiv 2020
[4]

Online learning of neural networks.arXiv preprint arXiv:2505.09167,

[DMM25] Amit Daniely, Idan Mehalel, and Elchanan Mossel. Online learning of neural networks.arXiv preprint arXiv:2505.09167,

work page arXiv
[5]

Mistake-bounded online learning with operation caps.arXiv preprint arXiv:2509.03892,

[GLT25] Jesse Geneson, Meien Li, and Linus Tang. Mistake-bounded online learning with operation caps.arXiv preprint arXiv:2509.03892,

work page arXiv
[6]

Transformers provably learn chain-of-thought reasoning with length generalization

[HWS+25] Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, and Yuxin Chen. Transformers provably learn chain-of-thought reasoning with length generalization.arXiv preprint arXiv:2511.07378,

work page arXiv
[7]

A theory of learning with autoregressive chain of thought.arXiv preprint arXiv:2503.07932,

[JVB+25] Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misi- akiewicz, and Nathan Srebro. A theory of learning with autoregressive chain of thought.arXiv preprint arXiv:2503.07932,

work page arXiv
[8]

Inherent limitations of dimensions for character- izing learnability of distribution classes

[LB24] Tosca Lechner and Shai Ben-David. Inherent limitations of dimensions for character- izing learnability of distribution classes. In Shipra Agrawal and Aaron Roth, editors, The Thirty Seventh Annual Conference on Learning Theory, June 30 - July 3, 2023, 41 Edmonton, Canada, Proceedings of Machine Learning Research, pages 3353–3374. PMLR,

2023
[9]

Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023

[Mal23] Eran Malach. Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979,

work page arXiv
[10]

From reasoning to super-intelligence: A search-theoretic perspective.arXiv preprint arXiv:2507.15865,

[SSS25] Shai Shalev-Shwartz and Amnon Shashua. From reasoning to super-intelligence: A search-theoretic perspective.arXiv preprint arXiv:2507.15865,

work page arXiv
[11]

How rein- forcement learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495, 2025

[TMUK25] Nikolaos Tsilivis, Eran Malach, Karen Ullrich, and Julia Kempe. How rein- forcement learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495,

work page arXiv
[12]

Tree dimension and the sauer-shelah dichotomy.arXiv preprint arXiv:2203.12211,

[Wal22] Roland Walker. Tree dimension and the sauer-shelah dichotomy.arXiv preprint arXiv:2203.12211,

work page arXiv