arxiv: 2605.06819 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

Ilan Doron-Arad , Idan Mehalel , Elchanan Mossel

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords autoregressive learningonline mistake boundschain-of-thought feedbackend-to-end feedbackgeneration horizonnext-token generatorrealizable online learning

0 comments

The pith

In online autoregressive learning, end-to-end feedback allows mistake bounds to scale up to logarithmically with the generation horizon M, while full chain-of-thought feedback eliminates any dependence on M.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an online mistake-bound framework for functions that arise when an unknown next-token generator is applied repeatedly for a fixed horizon M steps. It contrasts the end-to-end model, where the learner receives only the final token after M steps, with the chain-of-thought model, where the learner also sees the entire intermediate trajectory. In the end-to-end case the paper shows that the optimal mistake bound can realize essentially any growth rate between constant and logarithmic in M, and that the logarithmic ceiling is tight. Access to the full trajectory removes all dependence on M from the bound. The results therefore quantify how intermediate supervision changes the cost of learning long autoregressive maps.

Core claim

We prove a taxonomy of possible mistake-bound growth rates in the generation horizon M for the end-to-end model: essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the chain-of-thought model, we show that access to the full generated trajectory eliminates the dependence on M altogether. We also analyze autoregressive linear threshold classes and prove optimal mistake bounds, as well as a new lower bound for the statistical setting.

What carries the argument

The online mistake bound for the final output of M repeated applications of an unknown next-token generator, under either final-token-only feedback or full-trajectory feedback.

If this is right

Any growth rate between constant and logarithmic in M is achievable for some autoregressive classes under end-to-end feedback.
Logarithmic growth in M is the worst-case rate that can be forced in the end-to-end model.
Full trajectory observation makes the optimal mistake bound independent of M.
Optimal mistake bounds are established for autoregressive linear threshold classes.
Several open questions from the prior PAC analysis of autoregressive maps are resolved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training with full reasoning traces could remove the sample-complexity penalty associated with long generation horizons.
The taxonomy suggests that the scaling of online regret with sequence length can be controlled by the choice of function class.
The same qualitative distinction between end-to-end and trajectory feedback may appear in non-realizable or noisy online settings.
Empirical tests on concrete next-token classes could check whether real language-model families attain the logarithmic worst case.

Load-bearing premise

The target function is realized exactly by repeated application of some fixed but unknown next-token generator for a fixed horizon M, in a realizable online setting with feedback after each prediction.

What would settle it

A concrete family of next-token generators for which the minimal end-to-end mistake bound grows faster than any logarithmic function of M, or a chain-of-thought setting in which the bound still scales with M.

Figures

Figures reproduced from arXiv: 2605.06819 by Elchanan Mossel, Idan Mehalel, Ilan Doron-Arad.

**Figure 1.** Figure 1: The generation tree TM(x (j) ), visualized for the hard class Fd,M with M = 4 and j = 1 (that is, x (j) = 0). Note that there are M − 1 red nodes, where the version space of each typically contains an O(1)/M fraction of the class, and thus in total a (M − 1) O(1) M = O(1) fraction of the class is contained in the union of red vertices. The edges where the size of the class drops by a factor of O(1)/M are b… view at source ↗

**Figure 2.** Figure 2: The tree TM(x) for M = 2, x = 0. 2. For every internal node v, the left outgoing edge is labeled with 0, and the right outgoing edge with 1. The set of vertices and edges of T are denoted by v(T) and e(T), respectively. Every path in the tree that starts at the root is called a prefix. A prefix that ends at a leaf is called a branch. The set of all branches in T is denoted B(T). A prefix can be identified … view at source ↗

read the original abstract

Autoregressive generation lies at the heart of the mechanism of large language models. It can be viewed as the repeated application of a next-token generator: starting from an input string (prompt), the generator is applied for $M$ steps, and the last generated token is taken as the final output. [Joshi et al., 2025] proposed a PAC model for studying the learnability of the input-output maps arising from this process. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next-token generator. We distinguish between two forms of feedback. In the End-to-End model, after each round the learner observes only the final token produced after $M$ autoregressive steps. In the Chain-of-Thought model, the learner is additionally shown the entire $M$-step trajectory. Our goal is to understand how the optimal mistake bound depends on the generation horizon $M$, and to what extent observing intermediate tokens can reduce this dependence. Our main results show that the online theory of autoregressive learning exhibits a qualitative picture analogous to the statistical one found by [Hanneke et al., 2026], but with a different scale of dependence on the generation horizon. In the End-to-End model, we prove a taxonomy of possible mistake-bound growth rates in the generation horizon $M$: essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the Chain-of-Thought model, we show that access to the full generated trajectory eliminates the dependence on $M$ altogether. We also analyze autoregressive linear threshold classes, and prove optimal mistake bounds, as well as a new lower bound for the statistical setting. Along the way, our results resolve several questions left open by [Joshi et al., 2025].

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps out a taxonomy of mistake-bound growth rates in online autoregressive learning and shows that full trajectory feedback removes all dependence on generation length M.

read the letter

The main point is that online learning of autoregressive next-token generators has a clean separation: end-to-end feedback allows mistake bounds to scale at any rate from constant up to logarithmic in the horizon M, with the log bound being tight, while full chain-of-thought trajectory feedback removes any dependence on M. The paper does a solid job moving the prior statistical framework into the online mistake-bound setting. It proves the taxonomy by constructing families of generators that achieve different growth rates and shows the upper bound matches. The chain-of-thought result is straightforward once you have the full trajectory, but it cleanly eliminates the M factor. They also give optimal bounds for autoregressive linear threshold classes and a new lower bound for the statistical case, which closes some questions from the earlier work. The arguments rely on standard realizable online learning assumptions applied to the repeated generator steps. The constructions for the taxonomy seem to use specific but valid next-token functions to hit the intermediate rates. A minor limitation is the focus on the realizable case with a fixed generator; this keeps the math clean but means the results don't directly address noisy or approximate settings that might arise in practice. The linear threshold part is concrete and useful as an example. This work is for researchers in learning theory who are looking at generative models or online sequence prediction. Someone thinking about theoretical guarantees for models like language models would find the distinction between feedback types useful. The paper shows clear thinking on the problem and engages with the literature by resolving open issues. It deserves serious peer review because the claims are specific, the setting is well-defined, and the results appear to hold up based on the derivations outlined. I would bring it to a reading group to discuss the constructions for the taxonomy and how they might generalize.

Referee Report

0 major / 2 minor

Summary. The manuscript develops an online learning framework for autoregressive generation, viewing it as repeated application of an unknown next-token generator over a fixed horizon M. It distinguishes the End-to-End model (feedback only on the final token) from the Chain-of-Thought model (full trajectory feedback). Main results establish a taxonomy of possible mistake-bound growth rates in M for End-to-End (any rate between constant and logarithmic), prove the logarithmic ceiling is tight, and show that Chain-of-Thought feedback removes all M-dependence. Additional contributions include optimal mistake bounds for autoregressive linear threshold classes, a new lower bound in the statistical setting, and resolution of open questions from Joshi et al. (2025). The analysis assumes a realizable setting with online feedback after each prediction.

Significance. If the derivations hold, the work provides a valuable theoretical foundation for online learnability of autoregressive models, extending prior PAC results with a different scale of M-dependence. The taxonomy and tightness results, combined with the demonstration that full trajectory access eliminates M-dependence, offer clear insights into feedback mechanisms relevant to LLM training. Credit is due for the parameter-free nature of the bounds under standard realizable assumptions and for resolving multiple open questions from the 2025 reference. This could inform algorithm design for chain-of-thought prompting and online adaptation.

minor comments (2)

[Abstract and Introduction] The abstract and introduction would benefit from an explicit enumeration of the open questions from Joshi et al. (2025) that are resolved, to make the contribution more immediately verifiable.
[Section 2 (Preliminaries)] Notation for the next-token generator and the induced target function is introduced clearly in the abstract but should be cross-referenced with a formal definition in the first technical section for consistency.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation for minor revision. We appreciate the accurate summary of our contributions, including the taxonomy of mistake-bound growth rates in the End-to-End model, the tightness of the logarithmic ceiling, the elimination of M-dependence under Chain-of-Thought feedback, the optimal bounds for linear threshold classes, the new statistical lower bound, and the resolution of open questions from Joshi et al. (2025).

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The derivation establishes mistake bounds for autoregressive online learning under standard realizable assumptions with fixed unknown next-token generator and horizon M. End-to-End results prove a taxonomy of growth rates (constant to logarithmic in M) with a tight log ceiling via direct analysis of the feedback model; Chain-of-Thought results show M-independence from full trajectory access. These follow from applying online learning theory to the autoregressive process without reducing to fitted parameters, self-definitions, or load-bearing self-citations. The linear threshold analysis and resolutions of prior open questions are likewise independent of the target claims. The framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract; the framework rests on standard domain assumptions from online learning theory. No free parameters, invented entities, or ad-hoc axioms are mentioned.

axioms (2)

domain assumption The target concept is induced by M repeated applications of a fixed unknown next-token generator.
This defines the input-output map the learner must acquire, as stated in the abstract.
domain assumption Feedback is received after each online prediction, either end-to-end or full trajectory.
Core distinction of the two models under study.

pith-pipeline@v0.9.0 · 5649 in / 1535 out tokens · 110972 ms · 2026-05-11T00:47:38.918308+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2.1: L(F_CoT-M) ≤ L(F) for every base class F and every M; Theorem 2.3: L(F_e2e-M) = O(L(F) log(L(F)M)); taxonomy of well-behaved sub-logarithmic rates r(M) realized by classes with L(F)=1.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Littlestone trees, generation trees T_M(x), SOA learner maintaining version space V_t, halving arguments for sub-logarithmic rates.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 16 canonical work pages · 7 internal anchors

[1]

Structured-seed local pseudorandom generators and their applications

[ABCM25] Benny Applebaum, Dung Bui, Geoffroy Couteau, and Nikolas Melissaris. Structured-seed local pseudorandom generators and their applications. InAp- proximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2025), volume 353, pages 63–1. Schloss Dagstuhl– Leibniz-Zentrum f¨ ur Informatik,

2025
[2]

Cot information: Improved sam- ple complexity under chain-of-thought supervision.arXiv preprint arXiv:2505.15927,

[AML25] Awni Altabaa, Omar Montasser, and John Lafferty. Cot information: Im- proved sample complexity under chain-of-thought supervision.arXiv preprint arXiv:2505.15927,

work page arXiv
[3]

[BMR+20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateu...

2020
[4]

Deterministic apple tasting.arXiv preprint arXiv:2410.10404,

[CM24] Zachary Chase and Idan Mehalel. Deterministic apple tasting.arXiv preprint arXiv:2410.10404,

work page arXiv
[5]

[DGY+25] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URL: https://doi.org/10.48550/arXiv.2501.12948, arXiv:2501.12948,doi:10.48550/ARXIV.2501.12948. [DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hins...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
[7]

The Llama 3 Herd of Models

URL: https://doi.org/10.48550/ arXiv.2407.21783,arXiv:2407.21783,doi:10.48550/ARXIV.2407.21783. [DSBDSS15] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the ERM principle.J. Mach. Learn. Res., 16(1):2377–2404,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783
[8]

Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End

Private communication with the authors. [HMM26] Steve Hanneke, Idan Mehalel, and Shay Moran. Sample complexity of autoregressive reasoning: Chain-of-thought vs. end-to-end.arXiv preprint arXiv:2604.12013,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End

arXiv:2604.12013,doi:10.48550/arXiv.2604.12013. [HWS+25] Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, and Yuxin Chen. Transformers provably learn chain-of-thought reasoning with length generalization.arXiv preprint arXiv:2511.07378,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.12013
[10]

OpenAI o1 System Card

[JKL+24] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex If- timie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
[11]

A theory of learning with autoregressive chain of thought.arXiv preprint arXiv:2503.07932,

[JVB+25] Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misi- akiewicz, and Nathan Srebro. A theory of learning with autoregressive chain of thought.arXiv preprint arXiv:2503.07932,

work page arXiv
[12]

Transformers provably solve parity efficiently with chain of thought.arXiv preprint arXiv:2410.08633, 2024

[KS24] Juno Kim and Taiji Suzuki. Transformers provably solve parity efficiently with chain of thought.arXiv preprint arXiv:2410.08633,

work page arXiv
[13]

From on-line to batch learning

[Lit89] Nick Littlestone. From on-line to batch learning. InProceedings of the Second Annual Workshop on Computational Learning Theory (COLT 1989), pages 269– 284, San Francisco, CA, USA,

1989
[14]

URL: https://dl.acm.org/doi/10.5555/93335.93365,doi:10.5555/93335.93365

Morgan Kaufmann Publishers Inc. URL: https://dl.acm.org/doi/10.5555/93335.93365,doi:10.5555/93335.93365. [LLZM24] Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InThe Twelfth International Conference on Learning Representations,

work page doi:10.5555/93335.93365
[15]

Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023

[Mal23] Eran Malach. Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979,

work page arXiv
[16]

The expressive power of transformers with chain of thought, 2024

[MS23] William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923,

work page arXiv
[18]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

URL:https://arxiv.org/abs/2112.00114,arXiv:2112.00114. [RWC+19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI,

work page internal anchor Pith review arXiv
[19]

The probably approximately correct learning model in compu- tational learning theory.arXiv preprint arXiv:2511.08791,

[Ser25] Rocco A Servedio. The probably approximately correct learning model in compu- tational learning theory.arXiv preprint arXiv:2511.08791,

work page arXiv
[20]

How rein- forcement learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495, 2025

[TMUK25] Nikolaos Tsilivis, Eran Malach, Karen Ullrich, and Julia Kempe. How reinforce- ment learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495,

work page arXiv
[21]

Gomez, Lukasz Kaiser, and Illia Polosukhin

[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Co...

2017
[22]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

44 [WBZ+22] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022,

2022