pith. sign in

arxiv: 2605.08390 · v1 · submitted 2026-05-08 · 💻 cs.LG

The Power of Second Order Methods for Sequence Preconditioning

Pith reviewed 2026-05-12 01:32 UTC · model grok-4.3

classification 💻 cs.LG
keywords universal sequence preconditioninglinear dynamical systemsregret boundsVovk-Azoury-Warmuth algorithmmarginally stable systemsonline learningChebyshev polynomials
0
0 comments X p. Extension

The pith

Pairing universal sequence preconditioning with the Vovk-Azoury-Warmuth algorithm delivers polylogarithmic regret for any marginally stable linear dynamical system.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Universal Sequence Preconditioning compresses any sequence generated by a linear dynamical system into one that needs exponentially less memory to predict accurately. This compression creates sequences with much larger diameters and gradients, which defeats standard first-order online learners. The Vovk-Azoury-Warmuth algorithm is shown to be a natural match because it exploits the shorter effective memory while staying stable under the larger scale. When the two are combined the resulting predictor achieves regret that grows only as the cube of the logarithm of the time horizon, even when the hidden transition matrix is asymmetric. The result extends to systems whose eigenvalues lie at constant distance from the unit circle by supplying new bounds on Chebyshev polynomials evaluated at complex arguments.

Core claim

For any marginally-stable linear dynamical system the algorithm that combines Universal Sequence Preconditioning with the Vovk-Azoury-Warmuth method attains regret O(log^3 T), and the same bound continues to hold when the hidden transition matrix is asymmetric; new complex-analytic bounds on Chebyshev polynomials further extend the technique to systems whose eigenvalues have constant complex arguments.

What carries the argument

Universal Sequence Preconditioning (USP), which maps a long-memory linear dynamical sequence to a preconditioned sequence whose effective memory length is exponentially shorter while increasing diameter and gradient norms by an exponential factor, paired with the Vovk-Azoury-Warmuth (VAW) second-order online learner that automatically adapts to the inflated scale.

If this is right

  • Regret becomes independent of the hidden dimension of the dynamical system and depends only polylogarithmically on the time horizon.
  • The same bound holds for systems whose hidden transition matrices are asymmetric.
  • The applicability of USP is no longer restricted to bounded real spectra; constant-complex-argument systems are now covered by the new Chebyshev bounds.
  • Second-order methods become the natural choice once preconditioning has compressed the memory length but inflated the scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preconditioning-plus-second-order pattern may yield dimension-free guarantees in other online prediction settings where first-order methods suffer from large condition numbers.
  • In control or forecasting applications the O(log^3 T) regret implies that long-horizon predictors can be maintained without the usual polynomial blow-up in state dimension.
  • The complex Chebyshev bounds open a route to handling lightly damped oscillatory linear systems without additional discretization or approximation steps.

Load-bearing premise

The observed sequence is generated exactly by a marginally stable linear dynamical system whose hidden transition matrix satisfies the stated spectral conditions.

What would settle it

An explicit counter-example sequence generated by a marginally stable linear system in which the USP+VAW predictor incurs regret that grows faster than any constant times log cubed of the horizon length.

Figures

Figures reproduced from arXiv: 2605.08390 by Annie Marsden, Elad Hazan.

Figure 1
Figure 1. Figure 1: Normalized prediction error versus degree of preconditioning polynomial for (a) [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Preconditioning empirically reduces the norm of the signal. If this is provable, learning the [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Sequence prediction methods for dynamical systems with long memory, i.e. marginally stable systems, typically achieve regret that grows polynomially with the hidden dimension of the underlying generative model. Universal Sequence Preconditioning (USP) is a method that compresses any sequence which comes from a linear dynamical system into a "preconditioned" sequence which requires exponentially shorter memory for accurate prediction. However, the preconditioned sequence yields exponentially larger diameters and gradients, hindering USP from unlocking optimal regret bounds. Inspired by the minimum description length principle, we show that the Vovk-Azoury-Warmuth (VAW) algorithm is naturally matched to the USP regime. Indeed, it takes advantage of the memory compression while remaining robust to the exponential explosion of the diameter. We prove that combining USP with VAW achieves astoundingly strong results: for any marginally-stable linear dynamical system, this algorithm achieves polylogarithmic regret $O \left( \log^3 T \right)$ even in the presence of asymmetric hidden transition matrices. Finally, we extend the applicability of USP beyond bounded-spectrum systems by providing new complex-analytic bounds on Chebyshev polynomials, allowing for systems with constant complex arguments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Universal Sequence Preconditioning (USP) to compress sequences generated by linear dynamical systems into preconditioned sequences with exponentially shorter memory requirements. It shows that combining USP with the Vovk-Azoury-Warmuth (VAW) algorithm yields O(log^3 T) regret for any marginally stable linear dynamical system, including those with asymmetric hidden transition matrices, by leveraging new complex-analytic bounds on Chebyshev polynomials that extend applicability to constant complex arguments.

Significance. If the central claims hold, this would constitute a significant advance in online prediction for long-memory dynamical systems by achieving polylogarithmic regret independent of hidden dimension. The matching of VAW to the USP regime to control diameter explosion is a notable technical contribution, and the new Chebyshev bounds may have value beyond the regret application if they deliver the required exponential approximation rates.

major comments (2)
  1. [Section on new complex-analytic bounds on Chebyshev polynomials] The extension to asymmetric matrices and the O(log^3 T) regret both rest on the new complex-analytic bounds on Chebyshev polynomials delivering exponential approximation rates for fixed complex z with |z|=1 (constant-complex-argument case). The manuscript must explicitly state the dependence of the required polynomial degree on the non-normality measure or eigenvalue distribution; if the degree grows with these quantities, the effective memory length after USP will not remain polylogarithmic and the regret claim fails.
  2. [Regret bound derivation] The proof that USP+VAW achieves O(log^3 T) regret must explicitly track how the exponential diameter and gradient explosion after preconditioning is absorbed by VAW without introducing dimension-dependent or super-logarithmic factors. The current presentation leaves the error analysis and explicit assumptions on the generative process (exact linear dynamical system with marginally stable transition matrix) insufficiently detailed to verify the bound.
minor comments (2)
  1. The abstract uses the phrase 'astoundingly strong results'; replace with a more neutral description of the claimed bound.
  2. [Preliminaries] Define 'marginally stable' explicitly in terms of the spectral radius or eigenvalue locations and clarify its relation to the unit-circle assumption used in the Chebyshev analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review. The comments identify key areas where greater explicitness will strengthen the manuscript. We respond to each major comment below and will revise accordingly.

read point-by-point responses
  1. Referee: [Section on new complex-analytic bounds on Chebyshev polynomials] The extension to asymmetric matrices and the O(log^3 T) regret both rest on the new complex-analytic bounds on Chebyshev polynomials delivering exponential approximation rates for fixed complex z with |z|=1 (constant-complex-argument case). The manuscript must explicitly state the dependence of the required polynomial degree on the non-normality measure or eigenvalue distribution; if the degree grows with these quantities, the effective memory length after USP will not remain polylogarithmic and the regret claim fails.

    Authors: We appreciate this observation. Our new complex-analytic bounds are derived precisely to achieve exponential approximation rates for constant complex arguments with |z|=1, thereby extending USP to asymmetric marginally stable matrices. The required polynomial degree depends on the non-normality measure and eigenvalue distribution only through logarithmic factors that are absorbed into the overall O(log^3 T) regret; it does not grow with hidden dimension or produce super-polylog memory length. In the revision we will add an explicit lemma stating this dependence (including the precise polylogarithmic bound on degree) so that the memory and regret claims can be verified directly. revision: yes

  2. Referee: [Regret bound derivation] The proof that USP+VAW achieves O(log^3 T) regret must explicitly track how the exponential diameter and gradient explosion after preconditioning is absorbed by VAW without introducing dimension-dependent or super-logarithmic factors. The current presentation leaves the error analysis and explicit assumptions on the generative process (exact linear dynamical system with marginally stable transition matrix) insufficiently detailed to verify the bound.

    Authors: We agree the current presentation of the regret analysis would benefit from additional detail. In the revised manuscript we will expand the proof to explicitly track the post-preconditioning diameter and gradient norms, showing step-by-step that VAW's logarithmic dependence on these quantities absorbs the exponential blow-up into the O(log^3 T) term with no hidden dimension dependence or super-logarithmic factors in T. We will also state the generative assumptions (exact linear dynamical system with marginally stable transition matrix) at the outset of the analysis section and include the relevant intermediate bounds on preconditioned quantities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent analysis and new bounds

full rationale

The abstract presents USP as a compression method for LDS sequences, matches it to VAW for robustness to diameter growth, and derives the O(log^3 T) regret bound from their combination. The extension to asymmetric matrices rests on newly provided complex-analytic bounds for Chebyshev polynomials at constant complex arguments. No quoted step reduces a prediction or central claim to a fitted parameter, self-definition, or load-bearing self-citation by construction. The regret result follows from algorithmic analysis rather than being presupposed in the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on the domain assumption that the data-generating process is a linear dynamical system with marginally stable transitions, plus the ad-hoc analytic bounds on Chebyshev polynomials supplied for the complex case. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The underlying process is a linear dynamical system whose hidden transition matrix is marginally stable.
    This is the setting in which the polylog regret is claimed to hold.
  • ad hoc to paper New complex-analytic bounds on Chebyshev polynomials hold for systems with constant complex arguments.
    The extension of USP applicability depends on these bounds being valid.

pith-pipeline@v0.9.0 · 5498 in / 1374 out tokens · 38915 ms · 2026-05-12T01:32:24.094430+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Relative loss bounds for on-line density estimation with the exponential family of distributions.Machine Learning, 43(3):211–246, 2001

    Katy S Azoury and Manfred K Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions.Machine Learning, 43(3):211–246, 2001

  2. [2]

    A new approach to learning linear dynamical systems

    Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Yau. A new approach to learning linear dynamical systems. InProceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 335–348, 2023

  3. [3]

    Cambridge university press, 2006

    Nicolò Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006

  4. [4]

    Oxford University Press, 3rd edition, 1999

    Chi-Tsong Chen.Linear System Theory and Design. Oxford University Press, 3rd edition, 1999

  5. [5]

    Universal learning of nonlinear dynamics

    Evan Dogariu, Anand Brahmbhatt, and Elad Hazan. Universal learning of nonlinear dynamics. arXiv preprint arXiv:2508.11990, 2025

  6. [6]

    Learning partially observed linear dynamical systems from logarithmic number of samples

    Salar Fattahi. Learning partially observed linear dynamical systems from logarithmic number of samples. InProceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 ofProceedings of Machine Learning Research, pages 60–72, 2021

  7. [7]

    No-regret prediction in marginally stable systems

    Udaya Ghai, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. No-regret prediction in marginally stable systems. InProceedings of the 33rd Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 1714–1757, 2020

  8. [8]

    MIT press, 2007

    Peter D Grünwald.The minimum description length principle. MIT press, 2007

  9. [9]

    Now Publishers, Inc., 2016

    Elad Hazan.Introduction to online convex optimization, volume 2. Now Publishers, Inc., 2016

  10. [10]

    Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192, 2007

    Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192, 2007

  11. [11]

    Spectral filtering for general linear dynamical systems

    Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. InAdvances in Neural Information Processing Systems 31, pages 4639–4648, 2018

  12. [12]

    Learning linear dynamical systems via spectral filtering

    Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. InAdvances in Neural Information Processing Systems 30, pages 6702–6712, 2017

  13. [13]

    Prentice-Hall, 1980

    Thomas Kailath.Linear Systems. Prentice-Hall, 1980

  14. [14]

    Improved rates for prediction and identification of partially observed linear dynamical systems

    Holden Lee. Improved rates for prediction and identification of partially observed linear dynamical systems. InProceedings of the 33rd International Conference on Algorithmic Learning Theory, volume 167 ofProceedings of Machine Learning Research, pages 668–698, 2022

  15. [15]

    Dimension-free regret for learning asymmetric linear dynamical systems.arXiv preprint arXiv:2502.06545, 2025

    Annie Marsden and Elad Hazan. Dimension-free regret for learning asymmetric linear dynamical systems.arXiv preprint arXiv:2502.06545, 2025

  16. [16]

    Universal sequence preconditioning

    Annie Marsden and Elad Hazan. Universal sequence preconditioning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  17. [17]

    Paria Rashidinejad, Jiantao Jiao, and Stuart J. Russell. Slip: Learning to predict in unknown dynamical systems with long-term memory. InAdvances in Neural Information Processing Systems 33, 2020

  18. [18]

    Modeling by shortest data description.Automatica, 14(5):465–471, 1978

    Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978

  19. [19]

    Learning linear dynamical systems with semi-parametric least squares

    Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. InProceedings of the 32nd Conference on Learning Theory, volume 99 ofProceedings of Machine Learning Research, pages 2714–2802, 2019

  20. [20]

    Anastasios Tsiamis and George J. Pappas. Finite sample analysis of stochastic system identi- fication. In2019 IEEE 58th Conference on Decision and Control (CDC), pages 3648–3654. IEEE, 2019. 10

  21. [21]

    Anastasios Tsiamis and George J. Pappas. Online learning of the kalman filter with logarithmic regret.IEEE Transactions on Automatic Control, 68(5):2774–2789, 2023

  22. [22]

    Competitive on-line statistics.International Statistical Review, 69(2):213–248, 2001

    V olodya V ovk. Competitive on-line statistics.International Statistical Review, 69(2):213–248, 2001. 11 A Technical Proofs Lemma 2.For anyz∈C, if|Im(z)| ≤tthen|cos(z)| ≤cosh(t). Proof. Let z=x+iy where x, y∈R . We use the identity for the magnitude of the complex cosine function: |cos(x+iy)| 2 = cos2(x)+sinh2(y). Since cos2(x)≤1 we have |cos(z)| 2 ≤1+sin...