Recognition: unknown
Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End
Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3
The pith
Chain-of-thought supervision makes sample complexity independent of generation length T
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish a taxonomy for how sample complexity scales with T: end-to-end learning can realize any growth rate r(T) between constant and linear under mild conditions, while chain-of-thought learning yields sample complexity independent of T altogether. Combined with the prior linear upper bound, this gives a nearly complete characterization of the dependence on generation length and resolves several open questions about the role of intermediate supervision.
What carries the argument
New combinatorial tools that bound the sample complexity of the hypothesis classes induced by autoregressive next-token processes under full versus final-only supervision.
If this is right
- End-to-end supervision permits sample complexity to scale in essentially any way with T between constant and linear.
- Chain-of-thought supervision renders sample complexity completely independent of T.
- The results supply a full characterization of learnability dependence on generation length.
- New combinatorial tools are developed to prove the upper and lower bounds on these rates.
Where Pith is reading between the lines
- For very long reasoning chains, full intermediate labels may be necessary to keep data requirements from growing.
- The separation could guide the design of training data that includes partial chains when full supervision is expensive.
- Empirical tests on specific generator families could check whether observed scaling matches the predicted arbitrary rates.
Load-bearing premise
The claim that end-to-end sample complexity can realize arbitrary rates between constant and linear rests on the existence of hypothesis classes satisfying mild conditions that allow constructions achieving each rate.
What would settle it
A concrete next-token generator class for which end-to-end sample complexity grows linearly with T while chain-of-thought sample complexity remains constant would confirm the separation; a class where both regimes exhibit the same rate would falsify the richness result.
Figures
read the original abstract
Modern large language models generate text autoregressively, producing tokens one at a time. To study the learnability of such systems, Joshi et al. (COLT 2025) introduced a PAC-learning framework for next-token generators, the primitive underlying autoregressive models. In this framework, an unknown next-token generator maps a sequence of tokens to the next token and is iteratively applied for $T$ steps, producing a chain of tokens whose final token constitutes the model's output. The learning task is to learn the input-output mapping induced by this autoregressive process. Depending on the available supervision, training examples may reveal only the final output (End-to-End supervision) or the entire generated chain (Chain-of-Thought supervision). This raises two natural questions: how the sample complexity depends on the generation length $T$, and how much Chain-of-Thought supervision can reduce this dependence. In this work we give a nearly complete answer to both questions by uncovering a taxonomy of how the sample complexity scales with $T$. For End-to-End learning, we show that the landscape is remarkably rich: subject to mild conditions, essentially any growth rate $r(T)$ between constant and linear can arise as the sample complexity, and combined with the linear upper bound of Joshi et al., this yields an essentially complete characterization. In contrast, under Chain-of-Thought supervision we show that the sample complexity is independent of $T$, demonstrating that access to intermediate reasoning steps can eliminate the dependence on the generation length altogether. Our analysis introduces new combinatorial tools, and as corollaries we resolve several open questions posed by Joshi et al. regarding the dependence of learnability on the generation length and the role of Chain-of-Thought supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the sample complexity of PAC-learning autoregressive next-token generators under End-to-End supervision (only final output observed) versus Chain-of-Thought supervision (full generation chain observed). It establishes that, subject to mild conditions on the generator class, End-to-End sample complexity can realize essentially any growth rate r(T) in [constant, linear] via explicit constructions, yielding a nearly complete characterization when paired with the linear upper bound from Joshi et al. (COLT 2025). In contrast, Chain-of-Thought reduces directly to PAC-learning the fixed base generator and produces T-independent bounds. New combinatorial tools are introduced, and several open questions from Joshi et al. on length dependence and the role of intermediate supervision are resolved as corollaries.
Significance. If the central taxonomy and constructions hold, the work provides a precise theoretical account of how supervision regime affects learnability in autoregressive models, showing that access to intermediate steps can eliminate all dependence on generation length T. The explicit constructions for arbitrary r(T) and the combinatorial arguments resolving prior open questions are notable strengths; the results are parameter-free in their scaling statements and falsifiable via the stated mild conditions.
minor comments (2)
- [§2.2] §2.2, Definition 3: the precise statement of the 'mild conditions' on the next-token generator class is referenced but not restated in full; including the exact formulation here would make the constructions in §4 self-contained.
- [Table 1] Table 1: the row for 'arbitrary r(T)' lists the End-to-End lower bound but omits the matching upper-bound reference to Joshi et al.; adding the citation would clarify the 'nearly complete characterization' claim.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work, the assessment of its significance, and the recommendation for minor revision. No specific major comments were raised in the report.
Circularity Check
No significant circularity; derivation is self-contained against external benchmarks
full rationale
The paper extends the external Joshi et al. (COLT 2025) PAC framework for next-token generators with new combinatorial arguments and explicit constructions. The End-to-End taxonomy (any r(T) in [const, linear] under mild conditions on the generator class) and the CoT T-independence result are derived directly from these tools and do not reduce to fitted parameters, self-definitions, or load-bearing self-citations. The linear upper bound is imported from the cited prior work (non-overlapping authors), and the new lower-bound constructions and corollaries resolving Joshi et al. open questions are independent of the target claims. No step equates a prediction to its input by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The PAC-learning framework for next-token generators and iterative application for T steps holds as defined by Joshi et al.
Forward citations
Cited by 1 Pith paper
-
A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
Online mistake bounds for autoregressive output learning can grow logarithmically with generation horizon M under end-to-end feedback but become independent of M with chain-of-thought trajectory access.
Reference graph
Works this paper leans on
-
[1]
[AML25] Awni Altabaa, Omar Montasser, and John Lafferty. Cot information: Improved sam- ple complexity under chain-of-thought supervision.arXiv preprint arXiv:2505.15927,
-
[2]
On learning verifiers for chain-of-thought reasoning.arXiv preprint arXiv:2505.22650,
[BBLS25] Maria-Florina Balcan, Avrim Blum, Zhiyuan Li, and Dravyansh Sharma. On learning verifiers for chain-of-thought reasoning.arXiv preprint arXiv:2505.22650,
-
[3]
arXiv preprint arXiv:2005.11818 , year=
COLT 2020 (9–12 July 2020). URL: https://proceedings.mlr.press/ v125/bousquet20a.html,arXiv:2005.11818. [DFSZ14] Thorsten Doliwa, Gaojian Fan, Hans Ulrich Simon, and Sandra Zilles. Recursive teaching dimension, vc-dimension and sample compression.The Journal of Machine Learning Research, 15(1):3107–3131,
-
[4]
Online learning of neural networks.arXiv preprint arXiv:2505.09167,
[DMM25] Amit Daniely, Idan Mehalel, and Elchanan Mossel. Online learning of neural networks.arXiv preprint arXiv:2505.09167,
-
[5]
Mistake-bounded online learning with operation caps.arXiv preprint arXiv:2509.03892,
[GLT25] Jesse Geneson, Meien Li, and Linus Tang. Mistake-bounded online learning with operation caps.arXiv preprint arXiv:2509.03892,
-
[6]
Transformers provably learn chain-of-thought reasoning with length generalization
[HWS+25] Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, and Yuxin Chen. Transformers provably learn chain-of-thought reasoning with length generalization.arXiv preprint arXiv:2511.07378,
-
[7]
A theory of learning with autoregressive chain of thought.arXiv preprint arXiv:2503.07932,
[JVB+25] Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misi- akiewicz, and Nathan Srebro. A theory of learning with autoregressive chain of thought.arXiv preprint arXiv:2503.07932,
-
[8]
Inherent limitations of dimensions for character- izing learnability of distribution classes
[LB24] Tosca Lechner and Shai Ben-David. Inherent limitations of dimensions for character- izing learnability of distribution classes. In Shipra Agrawal and Aaron Roth, editors, The Thirty Seventh Annual Conference on Learning Theory, June 30 - July 3, 2023, 41 Edmonton, Canada, Proceedings of Machine Learning Research, pages 3353–3374. PMLR,
2023
-
[9]
Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023
[Mal23] Eran Malach. Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979,
-
[10]
[SSS25] Shai Shalev-Shwartz and Amnon Shashua. From reasoning to super-intelligence: A search-theoretic perspective.arXiv preprint arXiv:2507.15865,
-
[11]
[TMUK25] Nikolaos Tsilivis, Eran Malach, Karen Ullrich, and Julia Kempe. How rein- forcement learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495,
-
[12]
Tree dimension and the sauer-shelah dichotomy.arXiv preprint arXiv:2203.12211,
[Wal22] Roland Walker. Tree dimension and the sauer-shelah dichotomy.arXiv preprint arXiv:2203.12211,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.