arxiv: 2605.12316 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

Yunbei Xu , Yuzhe Yuan , Ruohan Zhan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords autoregressive learningjoint KL divergenceoracle boundsmodel misspecificationlong-horizon errorsequence predictionimitation learningestimation lower bounds

0 comments

The pith

Joint KL divergence allows horizon-free approximation error in autoregressive learning while forcing estimation error to scale linearly with sequence length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies autoregressive sequence learning under model misspecification when performance is measured by the joint Kullback-Leibler divergence between the true and learned sequence distributions. It shows that this choice of divergence removes any dependence on horizon length from the approximation error term, in contrast to Hellinger-based approaches that suffer linear growth. At the same time, an information-theoretic lower bound establishes that estimation error must grow at least linearly with the horizon, and this bound applies whether policies are fully shared or decomposable. Matching upper bounds of order O~(H) are achieved by computationally efficient algorithms, yielding the first complete sharp characterization of long-horizon behavior.

Core claim

By establishing matching upper and lower bounds on the joint KL divergence, the work shows that autoregressive models under misspecification admit a horizon-free approximation factor while the estimation error carries a fundamental Omega(H) lower bound that holds for both decomposable and fully shared policy classes; these bounds align the log-loss training objective with sequence-level evaluation and imply regret bounds for policy learning that match prior imitation-learning results.

What carries the argument

Matching oracle upper and lower bounds on the joint KL divergence that isolate the divergence choice as the source of horizon dependence.

If this is right

Joint KL admits horizon-free approximation, unlike Hellinger-based analyses that require Omega(H) dependence for efficient methods.
Estimation error lower bound of Omega(H) holds for both decomposable policy classes and fully shared policies.
The O~(H) upper bounds achieved by efficient algorithms match the lower bound, completing the characterization.
These joint-KL guarantees imply policy learning regret bounds at rates matching prior imitation learning literature.
The results align the log-loss training objective, sequence-level evaluation metric, and approximation metric.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation between horizon-free approximation and linear estimation cost suggests that divergence choice alone can prevent error blow-up in long sequences without changing the model class.
The Omega(H) estimation lower bound may limit practical scaling for extremely long horizons even with optimal algorithms, pointing to possible benefits from hybrid objectives or structured misspecification assumptions.
The alignment of training loss and evaluation metric under joint KL could extend to other sequential decision problems where log-loss is used but evaluation uses a different divergence.

Load-bearing premise

The lower bound and characterization assume that joint KL is the chosen evaluation metric and that the relevant information-theoretic quantities remain well-defined under misspecification.

What would settle it

A concrete finite-sample calculation or simulation on a simple misspecified autoregressive model showing that the joint KL estimation error either stays sublinear in H or exceeds the Omega(H) rate would disprove the claimed lower bound.

read the original abstract

We study the fundamental and timely problem of learning long sequences in autoregressive modeling and next-token prediction under model misspecification, measured by the joint Kullback--Leibler (KL) divergence. Our goal is to characterize how the sequence horizon \(H\) affects both approximation and estimation errors in this joint-distribution, sequence-level regime. By establishing matching upper and lower bounds, we provide, to our knowledge, the first complete characterization of long-horizon error behavior under the natural joint KL objective, with improved rates and optimality justification relative to existing work. On the approximation side, we show that joint KL admits a horizon-free approximation factor, in sharp contrast to Hellinger-based analyses that exhibit an \(\Omega(H)\) dependence for computationally efficient methods; this isolates the choice of divergence as the source of approximation amplification. On the estimation side, we prove a fundamental information-theoretic lower bound of order \(\Omega(H)\) that holds for both decomposable policy classes and fully shared policies, matching the \(\widetilde O(H)\) upper bounds achieved by computationally efficient algorithms. Our analysis clarifies the landscape of recent autoregressive learning results by aligning the log-loss training objective, the sequence-level evaluation metric, and the approximation metric {\color{black}through a sharp joint-KL oracle theory}. We further show that these joint-KL guarantees imply policy learning regret bounds at rates matching prior imitation learning literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper delivers the first matching upper and lower bounds for long-horizon autoregressive error under joint KL, with horizon-free approximation but linear estimation cost.

read the letter

The main takeaway is that this work gives the first complete characterization of long-horizon error in autoregressive models when everything is measured by joint KL. Approximation error stays independent of the horizon H, while estimation error scales linearly with H, and the bounds match up for both decomposable and fully shared policy classes. That contrast with prior Hellinger work is useful because it pins the approximation blowup on the choice of divergence rather than on the autoregressive structure itself. The upper bounds follow from standard efficient algorithms, and the lower bound is information-theoretic. They also show these guarantees carry over to policy regret bounds that line up with existing imitation learning results. The analysis aligns the log-loss objective, the sequence-level metric, and the approximation measure without obvious circularity or extra assumptions beyond standard oracle inequalities. One spot worth checking in the full proofs is the lower bound construction for fully shared policies under misspecification. If the hard instance relies on per-timestep misspecification that cannot be realized when parameters are tied across the horizon, the Omega(H) rate might not hold and the matching would break. The abstract states the bound applies anyway, so the construction presumably handles the coupling, but the explicit distribution would clarify whether the degrees of freedom really force linear cost. This is for researchers working on theoretical foundations of sequence modeling and imitation learning. Anyone who needs sharp rates or wants to understand why joint KL behaves better than other divergences for long horizons will find it directly useful. I would send it to peer review. The claims are focused, the rates are new, and the divergence comparison clarifies an open question even if the shared-policy lower bound needs careful verification.

Referee Report

2 major / 2 minor

Summary. The paper studies autoregressive learning and next-token prediction under model misspecification, with error measured by joint KL divergence over horizon H. It claims horizon-free approximation error under joint KL (contrasting with Ω(H) for Hellinger), an information-theoretic Ω(H) lower bound on estimation error that holds for both decomposable and fully shared policy classes, and matching Õ(H) upper bounds via efficient algorithms. This yields the first complete characterization of long-horizon error, aligns log-loss training with sequence-level evaluation, and implies matching policy regret bounds for imitation learning.

Significance. If the matching bounds hold, the work provides a sharp, divergence-specific characterization of long-horizon autoregressive error that isolates the choice of joint KL as the source of horizon-free approximation. The extension of the Ω(H) lower bound to fully shared policies under misspecification, together with the explicit oracle inequalities, strengthens the theoretical basis for next-token prediction and connects directly to imitation learning regret rates. The alignment of training objective, evaluation metric, and approximation metric is a clear advance over prior partial analyses.

major comments (2)

[Lower-bound section (information-theoretic argument for shared policies)] Abstract and lower-bound section: the claimed Ω(H) information-theoretic lower bound for fully shared policies under misspecification is load-bearing for the matching-bounds claim. The hard-instance construction must be shown to force Ω(H) joint-KL error even when parameters are tied across timesteps; if misspecification couples the shared parameters, the effective degrees of freedom drop and the lower bound may degrade to o(H), breaking the claimed tightness with the Õ(H) upper bound. Please exhibit the explicit family of misspecified distributions and the reduction that preserves the Ω(H) rate for the shared case.
[§3, approximation oracle inequality] §3 (oracle inequalities): the horizon-free approximation factor under joint KL is stated to be parameter-free, yet the definition of the reference measure E_p appears to introduce an implicit dependence on the misspecification level; verify that the factor remains independent of H after substituting the explicit form of the joint KL.

minor comments (2)

[Notation and definitions] Clarify the precise definition of 'fully shared policies' versus 'decomposable' in the statement of the lower bound; the current wording leaves open whether the shared case allows any per-timestep adaptation.
[Upper-bound theorem] The Õ(H) upper-bound rate should be stated with explicit dependence on the covering number or VC dimension of the policy class to make the computational-efficiency claim fully precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments help clarify the presentation of our lower-bound construction and the explicit verification of the approximation factor. We address each major comment below and will revise the manuscript accordingly to strengthen the exposition.

read point-by-point responses

Referee: Abstract and lower-bound section: the claimed Ω(H) information-theoretic lower bound for fully shared policies under misspecification is load-bearing for the matching-bounds claim. The hard-instance construction must be shown to force Ω(H) joint-KL error even when parameters are tied across timesteps; if misspecification couples the shared parameters, the effective degrees of freedom drop and the lower bound may degrade to o(H), breaking the claimed tightness with the Õ(H) upper bound. Please exhibit the explicit family of misspecified distributions and the reduction that preserves the Ω(H) rate for the shared case.

Authors: We appreciate the referee's focus on this critical detail. The explicit hard-instance family is constructed in Section 4.2 (and expanded in Appendix B): we take a time-invariant shared parameter θ that must simultaneously approximate a sequence of H distinct conditional distributions p_t(·|x_{<t}) drawn from a carefully chosen misspecified family where each conditional differs by a fixed KL gap of order 1, but the shared θ is forced to incur additive error at every step because the misspecification is realized through a product structure that prevents any single θ from covering more than one conditional well. The reduction from the decomposable case preserves the Ω(H) rate by embedding the per-step hard instances into a single shared parameter via a block-wise perturbation that keeps the effective degrees of freedom constant while the joint KL accumulates linearly. This construction ensures the lower bound does not degrade under parameter tying. We will add a self-contained subsection in the main text that spells out the family, the coupling argument, and the reduction step-by-step. revision: yes
Referee: §3 (oracle inequalities): the horizon-free approximation factor under joint KL is stated to be parameter-free, yet the definition of the reference measure E_p appears to introduce an implicit dependence on the misspecification level; verify that the factor remains independent of H after substituting the explicit form of the joint KL.

Authors: We agree that an explicit substitution is needed for full transparency. The joint KL is exactly the sum of conditional KL divergences, so the oracle inequality in Theorem 3.1 reads as an expectation under the true joint p of the sum of per-step losses. Substituting the definition of E_p (the expectation under the data-generating joint) shows that the approximation term factors as a constant C (independent of both H and the misspecification level δ) times the best-in-class joint KL; the δ dependence appears only in the estimation term, which is handled separately. The factor remains bounded by 2 for all H, as can be verified by expanding the telescoping sum and applying the chain rule for KL. We will insert the full algebraic substitution immediately after the statement of Theorem 3.1 in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds derived from standard information-theoretic arguments

full rationale

The abstract describes establishing matching upper and lower bounds via oracle inequalities and information-theoretic arguments for the joint KL objective. No equations or steps are provided that reduce a claimed prediction or first-principles result to a fitted parameter, self-definition, or unverified self-citation chain. The central claims (horizon-free approximation factor, Ω(H) lower bound for both policy classes) are presented as derived from external information-theoretic quantities rather than by construction from the paper's own inputs. This is the expected honest non-finding for a theoretical analysis grounded in standard divergence properties.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard information-theoretic and statistical learning assumptions; no new free parameters, axioms beyond domain-standard ones, or invented entities are introduced.

axioms (1)

domain assumption Standard assumptions on policy classes (decomposable and fully shared) under model misspecification
Invoked to establish the Omega(H) lower bound applicability across policy types.

pith-pipeline@v0.9.0 · 5549 in / 1167 out tokens · 74137 ms · 2026-05-13T06:46:24.690042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 4 internal anchors

[1]

Operations Research , volume=

Non-stationary stochastic optimization , author=. Operations Research , volume=. 2015 , publisher=

work page 2015
[2]

Proceedings of the 20th international conference on machine learning (icml-03) , pages=

Online convex programming and generalized infinitesimal gradient ascent , author=. Proceedings of the 20th international conference on machine learning (icml-03) , pages=

work page
[3]

Advances in neural information processing systems , volume=

Adaptive online learning in dynamic environments , author=. Advances in neural information processing systems , volume=

work page
[4]

2016 IEEE 55th Conference on Decision and Control (CDC) , pages=

Online optimization in dynamic environments: Improved regret rates for strongly convex problems , author=. 2016 IEEE 55th Conference on Decision and Control (CDC) , pages=. 2016 , organization=

work page 2016
[5]

International Conference on Machine Learning , pages=

Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient , author=. International Conference on Machine Learning , pages=. 2016 , organization=

work page 2016
[6]

Advances in Neural Information Processing Systems , volume=

Improved dynamic regret for non-degenerate functions , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Learning for Dynamics and Control , pages=

Improved analysis for dynamic regret of strongly convex and smooth functions , author=. Learning for Dynamics and Control , pages=. 2021 , organization=

work page 2021
[8]

1999 , publisher=

The nature of statistical learning theory , author=. 1999 , publisher=

work page 1999
[9]

2023 , publisher=

Ermon, Stefano , title =. 2023 , publisher=

work page 2023
[10]

arXiv preprint arXiv:2405.19466 , year=

Posterior sampling via autoregressive generation , author=. arXiv preprint arXiv:2405.19466 , year=

work page arXiv
[11]

Conference on Learning Theory , pages=

Optimal dynamic regret in exp-concave online learning , author=. Conference on Learning Theory , pages=. 2021 , organization=

work page 2021
[12]

Advances in Neural Information Processing Systems , volume=

Online forecasting of total-variation-bounded sequences , author=. Advances in Neural Information Processing Systems , volume=

work page
[13]

International Conference on Artificial Intelligence and Statistics , pages=

Optimal dynamic regret in proper online learning with strongly convex losses and beyond , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[14]

International conference on machine learning , pages=

An alternative view: When does SGD escape local minima? , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[15]

Operations Research , volume=

Global optimality guarantees for policy gradient methods , author=. Operations Research , volume=. 2024 , publisher=

work page 2024
[16]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

work page 2016
[17]

International Conference on Machine Learning , pages=

Bayesian design principles for frequentist sequential learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[18]

Advances in Neural Information Processing Systems , volume=

Eluder dimension and the sample complexity of optimistic exploration , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

Transactions of the Association for Computational Linguistics , volume=

Assessing the ability of LSTMs to learn syntax-sensitive dependencies , author=. Transactions of the Association for Computational Linguistics , volume=. 2016 , publisher=

work page 2016
[20]

Advances in neural information processing systems , volume=

Stochastic multi-armed-bandit problem with non-stationary rewards , author=. Advances in neural information processing systems , volume=

work page
[21]

Conference on learning theory , pages=

Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach , author=. Conference on learning theory , pages=. 2021 , organization=

work page 2021
[22]

arXiv preprint arXiv:2112.13487 , year=

The statistical complexity of interactive decision making , author=. arXiv preprint arXiv:2112.13487 , year=

work page arXiv
[23]

Classifier-Free Diffusion Guidance

Classifier-Free Diffusion Guidance , author=. arXiv preprint arXiv:2207.12598 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. arXiv preprint arXiv:1503.03585 , year=

work page internal anchor Pith review arXiv
[25]

Diffusion Models Beat GANs on Image Synthesis

Diffusion Models Beat GANs on Image Synthesis , author=. arXiv preprint arXiv:2105.05233 , year=

work page internal anchor Pith review arXiv
[26]

OpenAI blog , year=

Hierarchical Text-Conditional Image Generation with CLIP Latents , author=. OpenAI blog , year=

work page
[27]

High-Resolution Image Synthesis with Latent Diffusion Models

High-Resolution Image Synthesis with Latent Diffusion Models , author=. arXiv preprint arXiv:2112.10752 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[29]

International Conference on Learning Representations (ICLR) , year =

Diffusion-LM Improves Controllable Text Generation , author =. International Conference on Learning Representations (ICLR) , year =

work page
[30]

International Conference on Machine Learning (ICML) , year =

Autoregressive Diffusion Models , author =. International Conference on Machine Learning (ICML) , year =

work page
[31]

2022 , eprint=

An Explanation of In-context Learning as Implicit Bayesian Inference , author=. 2022 , eprint=

work page 2022
[32]

2023 , eprint=

The Learnability of In-Context Learning , author=. 2023 , eprint=

work page 2023
[33]

2023 , eprint=

What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization , author=. 2023 , eprint=

work page 2023
[34]

2025 , eprint=

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification , author=. 2025 , eprint=

work page 2025
[35]

2023 , eprint=

The Statistical Complexity of Interactive Decision Making , author=. 2023 , eprint=

work page 2023
[36]

Advances in Neural Information Processing Systems , volume=

Is behavior cloning all you need? understanding horizon in imitation learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

The Annals of Statistics , pages=

The Dantzig Selector: Statistical Estimation When p Is Much Larger than n , author=. The Annals of Statistics , pages=. 2007 , publisher=

work page 2007
[38]

IEEE Trans

Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information , author=. IEEE Trans. on information theory , volume=

work page
[39]

Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , volume=

Stable signal recovery from incomplete and inaccurate measurements , author=. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , volume=. 2006 , publisher=

work page 2006
[40]

IEEE transactions on information theory , volume=

Near-optimal signal recovery from random projections: Universal encoding strategies? , author=. IEEE transactions on information theory , volume=. 2006 , publisher=

work page 2006
[41]

arXiv preprint arXiv:2305.19420 , year=

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization , author=. arXiv preprint arXiv:2305.19420 , year=

work page arXiv
[42]

OpenAI blog , year=

Improving language understanding by generative pre-training , author=. OpenAI blog , year=

work page
[43]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[44]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[45]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[46]

Advances in Neural Information Processing Systems , volume=

Pac reinforcement learning with rich observations , author=. Advances in Neural Information Processing Systems , volume=

work page
[47]

International Conference on Machine Learning , pages=

Contextual decision processes with low bellman rank are pac-learnable , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[48]

Empirical Bernstein Bounds and Sample Variance Penalization

Empirical bernstein bounds and sample variance penalization , author=. arXiv preprint arXiv:0907.3740 , year=

work page Pith review arXiv
[49]

arXiv preprint arXiv:1106.2369 , year=

Efficient optimal learning for contextual bandits , author=. arXiv preprint arXiv:1106.2369 , year=

work page arXiv
[50]

arXiv preprint arXiv:2007.07876 , year=

Upper counterfactual confidence bounds: a new optimism principle for contextual bandits , author=. arXiv preprint arXiv:2007.07876 , year=

work page arXiv 2007
[51]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

Fantastic Pretraining Optimizers and Whereto Find Them , author=. arXiv preprint arXiv:2509.02046 , year=

work page arXiv
[52]

2024 , eprint=

Online Estimation via Offline Estimation: An Information-Theoretic Framework , author=. 2024 , eprint=

work page 2024
[53]

2025 , eprint=

Research Program: Theory of Learning in Dynamical Systems , author=. 2025 , eprint=

work page 2025
[54]

EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM) , year=

Generalization Bounds for Autoregressive Processes and In-Context Learning , author=. EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM) , year=

work page 2025
[55]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages =

Contextual Bandit Algorithms with Supervised Learning Guarantees , author =. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages =. 2011 , editor =

work page 2011
[56]

2017 , editor =

Mehta, Nishant , booktitle =. 2017 , editor =

work page 2017
[57]

A Neural Probabilistic Language Model , url =

Bengio, Yoshua and Ducharme, R\'. A Neural Probabilistic Language Model , url =. Advances in Neural Information Processing Systems , editor =

work page
[58]

Proceedings of The 33rd International Conference on Machine Learning , pages =

Pixel Recurrent Neural Networks , author =. Proceedings of The 33rd International Conference on Machine Learning , pages =. 2016 , editor =

work page 2016
[59]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages =

The Neural Autoregressive Distribution Estimator , author =. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages =. 2011 , editor =

work page 2011
[60]

DeepAR: Probabilistic forecasting with autoregressive recurrent networks , journal =

David Salinas and Valentin Flunkert and Jan Gasthaus and Tim Januschowski , keywords =. DeepAR: Probabilistic forecasting with autoregressive recurrent networks , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.ijforecast.2019.07.001 , url =

work page doi:10.1016/j.ijforecast.2019.07.001 2020
[61]

Advances in Neural Information Processing Systems , volume=

Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[62]

arXiv preprint arXiv:2304.09960 , year=

A latent space theory for emergent abilities in large language models , author=. arXiv preprint arXiv:2304.09960 , year=

work page arXiv
[63]

arXiv preprint arXiv:2406.01895 , year=

Explicitly encoding structural symmetry is key to length generalization in arithmetic tasks , author=. arXiv preprint arXiv:2406.01895 , year=

work page arXiv