A Single Stepsize Suffices for Unprojected Linear TD(0): Simultaneous Robust and Fast Rates via Polyak--Ruppert Averaging
Pith reviewed 2026-06-26 00:37 UTC · model grok-4.3
The pith
A single stepsize schedule depending only on mixing time suffices for unprojected linear TD(0) with Polyak-Ruppert averaging to remain bounded and converge at the minimum of robust and fast rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
With the stepsize schedule η_t proportional to 1 over (τ_mix log(t) square-root t), the unprojected TD(0) iterates stay uniformly bounded with high probability on a single trajectory; the Polyak-Ruppert average then converges at the minimum of the curvature-free rate Õ(τ_mix over square-root T) and the curvature-dependent rate Õ(τ_mix squared over (ω T)).
What carries the argument
The Poisson-equation toolkit for geometrically mixing Markov chains, which decomposes the Markov noise into a martingale term plus a controlled remainder and supports a self-bounding inductive argument for pathwise stability.
If this is right
- No projection step is required for stability.
- The curvature parameter ω need not be known in advance.
- The same stepsize simultaneously delivers both the robust and the fast rate, taking the better of the two.
- All guarantees are high-probability and hold uniformly over the trajectory.
Where Pith is reading between the lines
- The self-bounding technique may transfer to other stochastic approximation schemes that use dependent samples.
- Practical implementations could drop projection operators entirely when the mixing time is known or estimated.
- The result suggests testing whether similar single-stepsize rules work for nonlinear or deep TD variants.
Load-bearing premise
The data are generated by a geometrically mixing Markov chain.
What would settle it
A geometrically mixing chain on which the proposed stepsize produces iterates that escape any fixed ball with probability bounded away from zero, or on which the averaged error fails to meet the minimum of the two stated rates.
read the original abstract
We study linear TD(0) under Markovian sampling, where data are generated along a single trajectory. We provide high-probability guarantees for a plain unprojected TD(0) algorithm with Polyak-Ruppert (PR) averaging, using a single stepsize schedule $\eta_t \propto \frac{1}{\tau_{\mathrm{mix}}\log(t)\sqrt{t}}$ that depends on the mixing time but requires no prior knowledge of the curvature parameter $\omega$. Our first result shows that such a choice of the stepsize guarantees that the TD(0) iterates are automatically and uniformly bounded with high probability, without projections and without any stability argument based on $\omega$. Building on this result, we establish a simultaneous high-probability convergence guarantee for the PR average: the same stepsize yields both a robust curvature-free $\widetilde{\mathcal{O}}\!\left(\frac{\tau_{\mathrm{mix}}}{\sqrt{T}}\right)$ rate and a fast curvature-dependent $\widetilde{\mathcal{O}}\!\left(\frac{\tau_{\mathrm{mix}}^2}{\omega T}\right)$rate, with the bound taking the minimum of the two. The core technical ingredient is a Poisson-equation toolkit for geometrically mixing Markov chains, which decomposes Markov noise into a martingale term plus a controlled remainder and enables a new self-bounding inductive argument for pathwise stability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that for linear TD(0) under single-trajectory Markovian sampling, the stepsize schedule η_t ∝ 1/(τ_mix log(t) √t) (depending only on mixing time) ensures that the unprojected iterates remain uniformly bounded with high probability, without projections or any stability argument that uses the curvature parameter ω. Building on this, the Polyak-Ruppert averaged iterates simultaneously achieve a robust curvature-free high-probability rate Õ(τ_mix/√T) and a fast curvature-dependent rate Õ(τ_mix²/(ω T)), with the bound taken as the minimum of the two. The core device is a Poisson-equation decomposition that splits the Markov noise into a martingale plus a controlled remainder, enabling a new self-bounding inductive argument for pathwise stability.
Significance. If the derivations close, the result is significant: it removes the need for projections or knowledge of ω while delivering both robustness and fast rates from one schedule. The Poisson-equation toolkit and the self-bounding induction are explicit strengths that could be reusable. The work directly addresses a practical pain point in TD learning under Markovian data.
major comments (1)
- [self-bounding inductive argument for pathwise stability (core technical ingredient)] The central boundedness claim (abstract and the self-bounding argument) asserts that the induction closes for arbitrary ω>0 with a stepsize independent of ω. However, because the Poisson remainder scales with ||θ_t|| and the contraction weakens as ω→0, the choice of threshold M and the martingale tail bound must be shown to close uniformly in ω. Please provide the explicit dependence (or independence) of M on ω and the accumulated sum of η_t in the relevant lemma or proposition that establishes the high-probability event.
minor comments (2)
- [statement of main results] Clarify the precise definition of the Õ notation and whether the log(t) factor in η_t is absorbed into the Õ or appears explicitly in the final bounds.
- [assumptions] The geometric mixing assumption is stated as weakest; confirm that all constants in the rates are allowed to depend on the mixing parameters but not on ω.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the significance of the result and for the careful reading of the technical core. We address the single major comment below.
read point-by-point responses
-
Referee: The central boundedness claim (abstract and the self-bounding argument) asserts that the induction closes for arbitrary ω>0 with a stepsize independent of ω. However, because the Poisson remainder scales with ||θ_t|| and the contraction weakens as ω→0, the choice of threshold M and the martingale tail bound must be shown to close uniformly in ω. Please provide the explicit dependence (or independence) of M on ω and the accumulated sum of η_t in the relevant lemma or proposition that establishes the high-probability event.
Authors: We agree that an explicit statement of the ω-independence of M is useful for clarity. In the self-bounding argument (Proposition 4.3 and the supporting Lemma 4.2), M is chosen as M = 2||θ_0|| + C τ_mix (1 + log(1/δ)), where C is an absolute constant; this choice does not depend on ω. The Poisson remainder is controlled by the geometric mixing and appears as an additive term whose accumulated effect is bounded by O(τ_mix sum_{s=1}^t η_s), but the induction is closed by showing that the probability of a large deviation is controlled via a Freedman-type martingale inequality whose variance proxy is sum η_s^2 ≲ 1/(τ_mix^2 log^2 t) summed up to O(1/τ_mix^2). Because the step-size schedule is independent of ω and the contraction ω is never invoked in the boundedness proof (only the smallness of η_t prevents overshoot), the same M works uniformly for all ω > 0. We will add a short remark after Proposition 4.3 that records these explicit expressions for M and the accumulated sums, confirming uniformity in ω. revision: yes
Circularity Check
Derivation self-contained with no reduction to inputs by construction
full rationale
The provided abstract and outline describe a Poisson-equation decomposition of Markov noise into martingale plus remainder, followed by a self-bounding inductive argument for pathwise stability of unprojected TD(0) iterates under a stepsize depending only on τ_mix. The resulting high-probability bounds are stated as the minimum of a curvature-free Õ(τ_mix/√T) rate and a curvature-dependent Õ(τ_mix²/(ω T)) rate; both quantities are treated as external properties of the chain and operator rather than quantities fitted or defined from the iterates themselves. No equations, self-citations, or ansatzes are shown that would reduce the claimed rates or stability claim to a tautology or to a parameter estimated from the target result. The paper explicitly asserts the argument avoids any ω-based stability step, and the given text supplies no counter-evidence of circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The data-generating Markov chain is geometrically mixing.
Reference graph
Works this paper leans on
-
[1]
Conference on Learning Theory , pages=
Stochastic linear optimization never overfits with quadratically-bounded losses on general data , author=. Conference on Learning Theory , pages=. 2022 , organization=
2022
-
[2]
2023 , organization=
Ivgi, Maor and Hinder, Oliver and Carmon, Yair , booktitle=. 2023 , organization=
2023
-
[3]
Conference on Learning Theory , pages=
A finite time analysis of temporal difference learning with linear function approximation , author=. Conference on Learning Theory , pages=. 2018 , organization=
2018
-
[4]
International Conference on Machine Learning , pages=
Temporal difference learning as gradient splitting , author=. International Conference on Machine Learning , pages=. 2021 , organization=
2021
-
[5]
A Simple Finite-Time Analysis of
Mitra, Aritra , journal=. A Simple Finite-Time Analysis of. 2025 , volume=
2025
-
[6]
arXiv preprint arXiv:2402.13903 , year=
Dealing with unbounded gradients in stochastic saddle-point optimization , author=. arXiv preprint arXiv:2402.13903 , year=
-
[7]
International Conference on Machine Learning , pages=
Unconstrained online learning with unbounded losses , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[8]
Stochastic gradient descent under
Even, Mathieu , booktitle=. Stochastic gradient descent under. 2023 , organization=
2023
-
[9]
Black-box reductions for parameter-free online learning in
Cutkosky, Ashok and Orabona, Francesco , booktitle=. Black-box reductions for parameter-free online learning in. 2018 , organization=
2018
-
[10]
2026 , publisher =
Mannor, Shie and Mansour, Yishay and Tamar, Aviv , title =. 2026 , publisher =
2026
-
[11]
Machine Learning , volume=
Learning to predict by the methods of temporal differences , author=. Machine Learning , volume=. 1988 , publisher=
1988
-
[12]
Logarithmic
Diaconis, Persi and Saloff-Coste, Laurent , journal=. Logarithmic. 1996 , publisher=
1996
-
[13]
Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies
Approximate temporal difference learning is a gradient descent for reversible policies , author=. arXiv preprint arXiv:1805.00869 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Advances in Neural Information Processing Systems , volume=
Analysis of temporal-difference learning with function approximation , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
Journal of Artificial Intelligence Research , volume=
Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity , author=. Journal of Artificial Intelligence Research , volume=
-
[16]
Finite-time error bounds for linear stochastic approximation and
Srikant, Rayadurgam and Ying, Lei , booktitle=. Finite-time error bounds for linear stochastic approximation and. 2019 , organization=
2019
-
[17]
Mastering the game of
Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and Ka...
2016
-
[18]
2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , author=. 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2017 , organization=
2017
-
[19]
Proceedings of the IEEE International Conference on Computer Vision , pages=
Deepdriving: Learning affordance for direct perception in autonomous driving , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=
-
[20]
Wiley Interdisciplinary Reviews: Computational Statistics , volume=
Stochastic approximation: a survey , author=. Wiley Interdisciplinary Reviews: Computational Statistics , volume=. 2010 , publisher=
2010
-
[21]
Finite sample analyses for
Dalal, Gal and Sz. Finite sample analyses for. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[22]
International Conference on Artificial Intelligence and Statistics , pages=
Linear stochastic approximation: How far does constant step-size and iterate averaging go? , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=
2018
-
[23]
Korda, Nathaniel and La, Prashanth , booktitle=. On. 2015 , organization=
2015
-
[24]
state-of-the-art
Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results , author=. 2017 , institution=
2017
-
[25]
arXiv preprint arXiv:2102.00236 , year=
Parameter-free Stochastic Optimization of Variationally Coherent Functions , author =. arXiv preprint arXiv:2102.00236 , year=
-
[26]
2017 , publisher=
Markov chains and mixing times , author=. 2017 , publisher=
2017
-
[27]
Journal of Machine Learning Research , volume =
Xiao, Lin , title =. Journal of Machine Learning Research , volume =. 2010 , pages =
2010
-
[28]
arXiv preprint arXiv:2505.21391 , year=
Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features , author=. arXiv preprint arXiv:2505.21391 , year=
-
[29]
A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with
Khodadadian, Sajad and Zubeldia, Martin , journal=. A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with
-
[30]
The Thirty Seventh Annual Conference on Learning Theory , pages=
Improved high-probability bounds for the temporal difference learning algorithm via exponential stability , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=
2024
-
[31]
SIAM Journal on Mathematics of Data Science , volume=
Accelerated and instance-optimal policy evaluation with linear function approximation , author=. SIAM Journal on Mathematics of Data Science , volume=. 2023 , publisher=
2023
-
[32]
International Conference on Artificial Intelligence and Statistics , pages=
Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=
2023
-
[33]
Automatica , volume=
Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning , author=. Automatica , volume=. 2022 , publisher=
2022
-
[34]
1998 , publisher=
Reinforcement Learning: An Introduction , author=. 1998 , publisher=
1998
-
[35]
SIAM Journal on Optimization , volume=
Robust stochastic approximation approach to stochastic programming , author=. SIAM Journal on Optimization , volume=. 2009 , publisher=
2009
-
[36]
Least squares regression with
Nagaraj, Dheeraj and Wu, Xian and Bresler, Guy and Jain, Prateek and Netrapalli, Praneeth , journal=. Least squares regression with
-
[37]
A Robust
Lee, Wei-Cheng and Orabona, Francesco , journal=. A Robust
-
[38]
Chernoff-type bound for finite
Lezaud, Pascal , journal=. Chernoff-type bound for finite. 1998 , publisher=
1998
-
[39]
2024 , url =
Harin Lee , title =. 2024 , url =
2024
-
[40]
Towards Weaker Variance Assumptions for Stochastic Optimization
Towards weaker variance assumptions for stochastic optimization , author=. arXiv preprint arXiv:2504.09951 , year=
work page internal anchor Pith review arXiv
-
[41]
arXiv preprint arXiv:2511.19937 , year=
Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization , author=. arXiv preprint arXiv:2511.19937 , year=
-
[42]
A Unified Stability Analysis of
Chang, Wei-Kai and Khanna, Rajiv , journal=. A Unified Stability Analysis of
-
[43]
Optimum bounds for the distributions of martingales in
Pinelis, Iosif , journal=. Optimum bounds for the distributions of martingales in. 1994 , publisher=
1994
-
[44]
The Annals of Probability , pages=
On tail probabilities for martingales , author=. The Annals of Probability , pages=. 1975 , publisher=
1975
-
[45]
Freedman's inequality for matrix martingales , author=
-
[46]
Advances in Neural Information Processing Systems , volume=
Statistical efficiency of distributional temporal difference learning , author=. Advances in Neural Information Processing Systems , volume=
-
[47]
IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
Adaptive temporal difference learning with linear function approximation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=
2021
-
[48]
Advances in Neural Information Processing Systems , volume=
Finite-time analysis of adaptive temporal difference learning with deep neural networks , author=. Advances in Neural Information Processing Systems , volume=
-
[49]
Journal of Machine Learning Research , volume=
Adaptive subgradient methods for online learning and stochastic optimization , author=. Journal of Machine Learning Research , volume=
-
[50]
Adam: A Method for Stochastic Optimization
Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Adaptive Bound Optimization for Online Convex Optimization
Adaptive bound optimization for online convex optimization , author=. arXiv preprint arXiv:1002.4908 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
International Conference on Machine Learning , pages=
A convergence theory for deep learning via over-parameterization , author=. International Conference on Machine Learning , pages=. 2019 , organization=
2019
-
[53]
Advances in Neural Information Processing Systems , volume=
High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=
-
[54]
2012 , publisher=
Markov chains and stochastic stability , author=. 2012 , publisher=
2012
-
[55]
arXiv preprint arXiv:2603.02577 , year=
Towards Parameter-Free Temporal Difference Learning , author=. arXiv preprint arXiv:2603.02577 , year=
-
[56]
Mathematics of Operations Research , volume =
Durmus, Alain and Moulines, Eric and Naumov, Alexey and Samsonov, Sergey , title =. Mathematics of Operations Research , volume =
-
[57]
High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation , year=
Li, Gen and Wu, Weichen and Chi, Yuejie and Ma, Cong and Rinaldo, Alessandro and Wei, Yuting , journal=. High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation , year=
-
[58]
Machine Learning , volume =
Concentration Bounds for Temporal Difference Learning with Linear Function Approximation: The Case of Batch Data and Uniform Sampling , author =. Machine Learning , volume =
-
[59]
On Linear Stochastic Approximation: Fine-grained
Mou, Wenlong and Li, Chris Junchi and Wainwright, Martin J and Bartlett, Peter L and Jordan, Michael I , booktitle =. On Linear Stochastic Approximation: Fine-grained
-
[60]
Asymptotic and finite sample analysis of nonexpansive stochastic approximations with
Blaser, Ethan and Zhang, Shangtong , journal=. Asymptotic and finite sample analysis of nonexpansive stochastic approximations with
-
[61]
, title =
Chandak, Siddharth and Borkar, Vivek S. , title =. Stochastic Systems , doi =
-
[62]
, title =
Konda, Vijaymohan R. , title =. 2002 , address =
2002
-
[63]
Sample complexity of asynchronous
Li, Gen and Wei, Yuting and Chi, Yuejie and Gu, Yuantao and Chen, Yuxin , journal=. Sample complexity of asynchronous
-
[64]
Faster rates, adaptive algorithms, and finite-time bounds for linear composition optimization and gradient
Raj, Anant and Joulani, Pooria and Gyorgy, Andras and Szepesv. Faster rates, adaptive algorithms, and finite-time bounds for linear composition optimization and gradient. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.