The Power of Second Order Methods for Sequence Preconditioning

arxiv: 2605.08390 · v1 · submitted 2026-05-08 · 💻 cs.LG

The Power of Second Order Methods for Sequence Preconditioning

Annie Marsden , Elad Hazan This is my paper

Pith reviewed 2026-05-12 01:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords universal sequence preconditioninglinear dynamical systemsregret boundsVovk-Azoury-Warmuth algorithmmarginally stable systemsonline learningChebyshev polynomials

0 comments p. Extension

The pith

Pairing universal sequence preconditioning with the Vovk-Azoury-Warmuth algorithm delivers polylogarithmic regret for any marginally stable linear dynamical system.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Universal Sequence Preconditioning compresses any sequence generated by a linear dynamical system into one that needs exponentially less memory to predict accurately. This compression creates sequences with much larger diameters and gradients, which defeats standard first-order online learners. The Vovk-Azoury-Warmuth algorithm is shown to be a natural match because it exploits the shorter effective memory while staying stable under the larger scale. When the two are combined the resulting predictor achieves regret that grows only as the cube of the logarithm of the time horizon, even when the hidden transition matrix is asymmetric. The result extends to systems whose eigenvalues lie at constant distance from the unit circle by supplying new bounds on Chebyshev polynomials evaluated at complex arguments.

Core claim

For any marginally-stable linear dynamical system the algorithm that combines Universal Sequence Preconditioning with the Vovk-Azoury-Warmuth method attains regret O(log^3 T), and the same bound continues to hold when the hidden transition matrix is asymmetric; new complex-analytic bounds on Chebyshev polynomials further extend the technique to systems whose eigenvalues have constant complex arguments.

What carries the argument

Universal Sequence Preconditioning (USP), which maps a long-memory linear dynamical sequence to a preconditioned sequence whose effective memory length is exponentially shorter while increasing diameter and gradient norms by an exponential factor, paired with the Vovk-Azoury-Warmuth (VAW) second-order online learner that automatically adapts to the inflated scale.

If this is right

Regret becomes independent of the hidden dimension of the dynamical system and depends only polylogarithmically on the time horizon.
The same bound holds for systems whose hidden transition matrices are asymmetric.
The applicability of USP is no longer restricted to bounded real spectra; constant-complex-argument systems are now covered by the new Chebyshev bounds.
Second-order methods become the natural choice once preconditioning has compressed the memory length but inflated the scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preconditioning-plus-second-order pattern may yield dimension-free guarantees in other online prediction settings where first-order methods suffer from large condition numbers.
In control or forecasting applications the O(log^3 T) regret implies that long-horizon predictors can be maintained without the usual polynomial blow-up in state dimension.
The complex Chebyshev bounds open a route to handling lightly damped oscillatory linear systems without additional discretization or approximation steps.

Load-bearing premise

The observed sequence is generated exactly by a marginally stable linear dynamical system whose hidden transition matrix satisfies the stated spectral conditions.

What would settle it

An explicit counter-example sequence generated by a marginally stable linear system in which the USP+VAW predictor incurs regret that grows faster than any constant times log cubed of the horizon length.

Figures

Figures reproduced from arXiv: 2605.08390 by Annie Marsden, Elad Hazan.

**Figure 2.** Figure 2: Preconditioning empirically reduces the norm of the signal. If this is provable, learning the [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Sequence prediction methods for dynamical systems with long memory, i.e. marginally stable systems, typically achieve regret that grows polynomially with the hidden dimension of the underlying generative model. Universal Sequence Preconditioning (USP) is a method that compresses any sequence which comes from a linear dynamical system into a "preconditioned" sequence which requires exponentially shorter memory for accurate prediction. However, the preconditioned sequence yields exponentially larger diameters and gradients, hindering USP from unlocking optimal regret bounds. Inspired by the minimum description length principle, we show that the Vovk-Azoury-Warmuth (VAW) algorithm is naturally matched to the USP regime. Indeed, it takes advantage of the memory compression while remaining robust to the exponential explosion of the diameter. We prove that combining USP with VAW achieves astoundingly strong results: for any marginally-stable linear dynamical system, this algorithm achieves polylogarithmic regret $O \left( \log^3 T \right)$ even in the presence of asymmetric hidden transition matrices. Finally, we extend the applicability of USP beyond bounded-spectrum systems by providing new complex-analytic bounds on Chebyshev polynomials, allowing for systems with constant complex arguments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims O(log^3 T) regret for any marginally stable linear dynamical system by pairing USP with VAW and adding new Chebyshev bounds for complex arguments, but the bound stands or falls on whether those approximations actually deliver exponential memory compression.

read the letter

The main takeaway is that combining universal sequence preconditioning with the Vovk-Azoury-Warmuth algorithm is supposed to give polylogarithmic regret even for asymmetric marginally stable systems. That would cut the usual polynomial dependence on hidden dimension down to log factors, which matters for long-memory predictors in online learning and control.

Referee Report

2 major / 2 minor

Summary. The paper introduces Universal Sequence Preconditioning (USP) to compress sequences generated by linear dynamical systems into preconditioned sequences with exponentially shorter memory requirements. It shows that combining USP with the Vovk-Azoury-Warmuth (VAW) algorithm yields O(log^3 T) regret for any marginally stable linear dynamical system, including those with asymmetric hidden transition matrices, by leveraging new complex-analytic bounds on Chebyshev polynomials that extend applicability to constant complex arguments.

Significance. If the central claims hold, this would constitute a significant advance in online prediction for long-memory dynamical systems by achieving polylogarithmic regret independent of hidden dimension. The matching of VAW to the USP regime to control diameter explosion is a notable technical contribution, and the new Chebyshev bounds may have value beyond the regret application if they deliver the required exponential approximation rates.

major comments (2)

[Section on new complex-analytic bounds on Chebyshev polynomials] The extension to asymmetric matrices and the O(log^3 T) regret both rest on the new complex-analytic bounds on Chebyshev polynomials delivering exponential approximation rates for fixed complex z with |z|=1 (constant-complex-argument case). The manuscript must explicitly state the dependence of the required polynomial degree on the non-normality measure or eigenvalue distribution; if the degree grows with these quantities, the effective memory length after USP will not remain polylogarithmic and the regret claim fails.
[Regret bound derivation] The proof that USP+VAW achieves O(log^3 T) regret must explicitly track how the exponential diameter and gradient explosion after preconditioning is absorbed by VAW without introducing dimension-dependent or super-logarithmic factors. The current presentation leaves the error analysis and explicit assumptions on the generative process (exact linear dynamical system with marginally stable transition matrix) insufficiently detailed to verify the bound.

minor comments (2)

The abstract uses the phrase 'astoundingly strong results'; replace with a more neutral description of the claimed bound.
[Preliminaries] Define 'marginally stable' explicitly in terms of the spectral radius or eigenvalue locations and clarify its relation to the unit-circle assumption used in the Chebyshev analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review. The comments identify key areas where greater explicitness will strengthen the manuscript. We respond to each major comment below and will revise accordingly.

read point-by-point responses

Referee: [Section on new complex-analytic bounds on Chebyshev polynomials] The extension to asymmetric matrices and the O(log^3 T) regret both rest on the new complex-analytic bounds on Chebyshev polynomials delivering exponential approximation rates for fixed complex z with |z|=1 (constant-complex-argument case). The manuscript must explicitly state the dependence of the required polynomial degree on the non-normality measure or eigenvalue distribution; if the degree grows with these quantities, the effective memory length after USP will not remain polylogarithmic and the regret claim fails.

Authors: We appreciate this observation. Our new complex-analytic bounds are derived precisely to achieve exponential approximation rates for constant complex arguments with |z|=1, thereby extending USP to asymmetric marginally stable matrices. The required polynomial degree depends on the non-normality measure and eigenvalue distribution only through logarithmic factors that are absorbed into the overall O(log^3 T) regret; it does not grow with hidden dimension or produce super-polylog memory length. In the revision we will add an explicit lemma stating this dependence (including the precise polylogarithmic bound on degree) so that the memory and regret claims can be verified directly. revision: yes
Referee: [Regret bound derivation] The proof that USP+VAW achieves O(log^3 T) regret must explicitly track how the exponential diameter and gradient explosion after preconditioning is absorbed by VAW without introducing dimension-dependent or super-logarithmic factors. The current presentation leaves the error analysis and explicit assumptions on the generative process (exact linear dynamical system with marginally stable transition matrix) insufficiently detailed to verify the bound.

Authors: We agree the current presentation of the regret analysis would benefit from additional detail. In the revised manuscript we will expand the proof to explicitly track the post-preconditioning diameter and gradient norms, showing step-by-step that VAW's logarithmic dependence on these quantities absorbs the exponential blow-up into the O(log^3 T) term with no hidden dimension dependence or super-logarithmic factors in T. We will also state the generative assumptions (exact linear dynamical system with marginally stable transition matrix) at the outset of the analysis section and include the relevant intermediate bounds on preconditioned quantities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent analysis and new bounds

full rationale

The abstract presents USP as a compression method for LDS sequences, matches it to VAW for robustness to diameter growth, and derives the O(log^3 T) regret bound from their combination. The extension to asymmetric matrices rests on newly provided complex-analytic bounds for Chebyshev polynomials at constant complex arguments. No quoted step reduces a prediction or central claim to a fitted parameter, self-definition, or load-bearing self-citation by construction. The regret result follows from algorithmic analysis rather than being presupposed in the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on the domain assumption that the data-generating process is a linear dynamical system with marginally stable transitions, plus the ad-hoc analytic bounds on Chebyshev polynomials supplied for the complex case. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption The underlying process is a linear dynamical system whose hidden transition matrix is marginally stable.
This is the setting in which the polylog regret is claimed to hold.
ad hoc to paper New complex-analytic bounds on Chebyshev polynomials hold for systems with constant complex arguments.
The extension of USP applicability depends on these bounds being valid.

pith-pipeline@v0.9.0 · 5498 in / 1374 out tokens · 38915 ms · 2026-05-12T01:32:24.094430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the regret of the Vovk-Azoury-Warmuth (VAW) forecaster matches this balance: it grows linearly with the effective memory and only logarithmically with the optimal diameter.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Relative loss bounds for on-line density estimation with the exponential family of distributions.Machine Learning, 43(3):211–246, 2001

Katy S Azoury and Manfred K Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions.Machine Learning, 43(3):211–246, 2001

work page 2001
[2]

A new approach to learning linear dynamical systems

Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Yau. A new approach to learning linear dynamical systems. InProceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 335–348, 2023

work page 2023
[3]

Cambridge university press, 2006

Nicolò Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006

work page 2006
[4]

Oxford University Press, 3rd edition, 1999

Chi-Tsong Chen.Linear System Theory and Design. Oxford University Press, 3rd edition, 1999

work page 1999
[5]

Universal learning of nonlinear dynamics

Evan Dogariu, Anand Brahmbhatt, and Elad Hazan. Universal learning of nonlinear dynamics. arXiv preprint arXiv:2508.11990, 2025

work page arXiv 2025
[6]

Learning partially observed linear dynamical systems from logarithmic number of samples

Salar Fattahi. Learning partially observed linear dynamical systems from logarithmic number of samples. InProceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 ofProceedings of Machine Learning Research, pages 60–72, 2021

work page 2021
[7]

No-regret prediction in marginally stable systems

Udaya Ghai, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. No-regret prediction in marginally stable systems. InProceedings of the 33rd Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 1714–1757, 2020

work page 2020
[8]

MIT press, 2007

Peter D Grünwald.The minimum description length principle. MIT press, 2007

work page 2007
[9]

Now Publishers, Inc., 2016

Elad Hazan.Introduction to online convex optimization, volume 2. Now Publishers, Inc., 2016

work page 2016
[10]

Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192, 2007

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192, 2007

work page 2007
[11]

Spectral filtering for general linear dynamical systems

Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. InAdvances in Neural Information Processing Systems 31, pages 4639–4648, 2018

work page 2018
[12]

Learning linear dynamical systems via spectral filtering

Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. InAdvances in Neural Information Processing Systems 30, pages 6702–6712, 2017

work page 2017
[13]

Prentice-Hall, 1980

Thomas Kailath.Linear Systems. Prentice-Hall, 1980

work page 1980
[14]

Improved rates for prediction and identification of partially observed linear dynamical systems

Holden Lee. Improved rates for prediction and identification of partially observed linear dynamical systems. InProceedings of the 33rd International Conference on Algorithmic Learning Theory, volume 167 ofProceedings of Machine Learning Research, pages 668–698, 2022

work page 2022
[15]

Dimension-free regret for learning asymmetric linear dynamical systems.arXiv preprint arXiv:2502.06545, 2025

Annie Marsden and Elad Hazan. Dimension-free regret for learning asymmetric linear dynamical systems.arXiv preprint arXiv:2502.06545, 2025

work page arXiv 2025
[16]

Universal sequence preconditioning

Annie Marsden and Elad Hazan. Universal sequence preconditioning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[17]

Paria Rashidinejad, Jiantao Jiao, and Stuart J. Russell. Slip: Learning to predict in unknown dynamical systems with long-term memory. InAdvances in Neural Information Processing Systems 33, 2020

work page 2020
[18]

Modeling by shortest data description.Automatica, 14(5):465–471, 1978

Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978

work page 1978
[19]

Learning linear dynamical systems with semi-parametric least squares

Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. InProceedings of the 32nd Conference on Learning Theory, volume 99 ofProceedings of Machine Learning Research, pages 2714–2802, 2019

work page 2019
[20]

Anastasios Tsiamis and George J. Pappas. Finite sample analysis of stochastic system identi- fication. In2019 IEEE 58th Conference on Decision and Control (CDC), pages 3648–3654. IEEE, 2019. 10

work page 2019
[21]

Anastasios Tsiamis and George J. Pappas. Online learning of the kalman filter with logarithmic regret.IEEE Transactions on Automatic Control, 68(5):2774–2789, 2023

work page 2023
[22]

Competitive on-line statistics.International Statistical Review, 69(2):213–248, 2001

V olodya V ovk. Competitive on-line statistics.International Statistical Review, 69(2):213–248, 2001. 11 A Technical Proofs Lemma 2.For anyz∈C, if|Im(z)| ≤tthen|cos(z)| ≤cosh(t). Proof. Let z=x+iy where x, y∈R . We use the identity for the magnitude of the complex cosine function: |cos(x+iy)| 2 = cos2(x)+sinh2(y). Since cos2(x)≤1 we have |cos(z)| 2 ≤1+sin...

work page 2001

[1] [1]

Relative loss bounds for on-line density estimation with the exponential family of distributions.Machine Learning, 43(3):211–246, 2001

Katy S Azoury and Manfred K Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions.Machine Learning, 43(3):211–246, 2001

work page 2001

[2] [2]

A new approach to learning linear dynamical systems

Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Yau. A new approach to learning linear dynamical systems. InProceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 335–348, 2023

work page 2023

[3] [3]

Cambridge university press, 2006

Nicolò Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games. Cambridge university press, 2006

work page 2006

[4] [4]

Oxford University Press, 3rd edition, 1999

Chi-Tsong Chen.Linear System Theory and Design. Oxford University Press, 3rd edition, 1999

work page 1999

[5] [5]

Universal learning of nonlinear dynamics

Evan Dogariu, Anand Brahmbhatt, and Elad Hazan. Universal learning of nonlinear dynamics. arXiv preprint arXiv:2508.11990, 2025

work page arXiv 2025

[6] [6]

Learning partially observed linear dynamical systems from logarithmic number of samples

Salar Fattahi. Learning partially observed linear dynamical systems from logarithmic number of samples. InProceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 ofProceedings of Machine Learning Research, pages 60–72, 2021

work page 2021

[7] [7]

No-regret prediction in marginally stable systems

Udaya Ghai, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. No-regret prediction in marginally stable systems. InProceedings of the 33rd Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 1714–1757, 2020

work page 2020

[8] [8]

MIT press, 2007

Peter D Grünwald.The minimum description length principle. MIT press, 2007

work page 2007

[9] [9]

Now Publishers, Inc., 2016

Elad Hazan.Introduction to online convex optimization, volume 2. Now Publishers, Inc., 2016

work page 2016

[10] [10]

Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192, 2007

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192, 2007

work page 2007

[11] [11]

Spectral filtering for general linear dynamical systems

Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. InAdvances in Neural Information Processing Systems 31, pages 4639–4648, 2018

work page 2018

[12] [12]

Learning linear dynamical systems via spectral filtering

Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. InAdvances in Neural Information Processing Systems 30, pages 6702–6712, 2017

work page 2017

[13] [13]

Prentice-Hall, 1980

Thomas Kailath.Linear Systems. Prentice-Hall, 1980

work page 1980

[14] [14]

Improved rates for prediction and identification of partially observed linear dynamical systems

Holden Lee. Improved rates for prediction and identification of partially observed linear dynamical systems. InProceedings of the 33rd International Conference on Algorithmic Learning Theory, volume 167 ofProceedings of Machine Learning Research, pages 668–698, 2022

work page 2022

[15] [15]

Dimension-free regret for learning asymmetric linear dynamical systems.arXiv preprint arXiv:2502.06545, 2025

Annie Marsden and Elad Hazan. Dimension-free regret for learning asymmetric linear dynamical systems.arXiv preprint arXiv:2502.06545, 2025

work page arXiv 2025

[16] [16]

Universal sequence preconditioning

Annie Marsden and Elad Hazan. Universal sequence preconditioning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[17] [17]

Paria Rashidinejad, Jiantao Jiao, and Stuart J. Russell. Slip: Learning to predict in unknown dynamical systems with long-term memory. InAdvances in Neural Information Processing Systems 33, 2020

work page 2020

[18] [18]

Modeling by shortest data description.Automatica, 14(5):465–471, 1978

Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978

work page 1978

[19] [19]

Learning linear dynamical systems with semi-parametric least squares

Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. InProceedings of the 32nd Conference on Learning Theory, volume 99 ofProceedings of Machine Learning Research, pages 2714–2802, 2019

work page 2019

[20] [20]

Anastasios Tsiamis and George J. Pappas. Finite sample analysis of stochastic system identi- fication. In2019 IEEE 58th Conference on Decision and Control (CDC), pages 3648–3654. IEEE, 2019. 10

work page 2019

[21] [21]

Anastasios Tsiamis and George J. Pappas. Online learning of the kalman filter with logarithmic regret.IEEE Transactions on Automatic Control, 68(5):2774–2789, 2023

work page 2023

[22] [22]

Competitive on-line statistics.International Statistical Review, 69(2):213–248, 2001

V olodya V ovk. Competitive on-line statistics.International Statistical Review, 69(2):213–248, 2001. 11 A Technical Proofs Lemma 2.For anyz∈C, if|Im(z)| ≤tthen|cos(z)| ≤cosh(t). Proof. Let z=x+iy where x, y∈R . We use the identity for the magnitude of the complex cosine function: |cos(x+iy)| 2 = cos2(x)+sinh2(y). Since cos2(x)≤1 we have |cos(z)| 2 ≤1+sin...

work page 2001