A Single Stepsize Suffices for Unprojected Linear TD(0): Simultaneous Robust and Fast Rates via Polyak--Ruppert Averaging

Francesco Orabona; Wei-Cheng Lee

arxiv: 2606.24981 · v1 · pith:ZDBMHHLAnew · submitted 2026-06-23 · 💻 cs.LG · math.OC· stat.ML

A Single Stepsize Suffices for Unprojected Linear TD(0): Simultaneous Robust and Fast Rates via Polyak--Ruppert Averaging

Wei-Cheng Lee , Francesco Orabona This is my paper

Pith reviewed 2026-06-26 00:37 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords linear TD(0)Polyak-Ruppert averagingMarkovian samplinghigh-probability boundsmixing timetemporal difference learningstochastic approximation

0 comments

The pith

A single stepsize schedule depending only on mixing time suffices for unprojected linear TD(0) with Polyak-Ruppert averaging to remain bounded and converge at the minimum of robust and fast rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that linear TD(0) under Markovian sampling can use one fixed stepsize rule that depends on the chain's mixing time but needs no knowledge of the curvature parameter. This rule makes the raw iterates bounded with high probability without any projection step or separate stability proof. The same rule then gives the Polyak-Ruppert average a high-probability error bound that automatically equals the smaller of a curvature-free robust rate and a curvature-dependent fast rate. A reader would care because it removes two common practical obstacles: choosing a projection radius and tuning a stepsize to an unknown curvature value.

Core claim

With the stepsize schedule η_t proportional to 1 over (τ_mix log(t) square-root t), the unprojected TD(0) iterates stay uniformly bounded with high probability on a single trajectory; the Polyak-Ruppert average then converges at the minimum of the curvature-free rate Õ(τ_mix over square-root T) and the curvature-dependent rate Õ(τ_mix squared over (ω T)).

What carries the argument

The Poisson-equation toolkit for geometrically mixing Markov chains, which decomposes the Markov noise into a martingale term plus a controlled remainder and supports a self-bounding inductive argument for pathwise stability.

If this is right

No projection step is required for stability.
The curvature parameter ω need not be known in advance.
The same stepsize simultaneously delivers both the robust and the fast rate, taking the better of the two.
All guarantees are high-probability and hold uniformly over the trajectory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-bounding technique may transfer to other stochastic approximation schemes that use dependent samples.
Practical implementations could drop projection operators entirely when the mixing time is known or estimated.
The result suggests testing whether similar single-stepsize rules work for nonlinear or deep TD variants.

Load-bearing premise

The data are generated by a geometrically mixing Markov chain.

What would settle it

A geometrically mixing chain on which the proposed stepsize produces iterates that escape any fixed ball with probability bounded away from zero, or on which the averaged error fails to meet the minimum of the two stated rates.

read the original abstract

We study linear TD(0) under Markovian sampling, where data are generated along a single trajectory. We provide high-probability guarantees for a plain unprojected TD(0) algorithm with Polyak-Ruppert (PR) averaging, using a single stepsize schedule $\eta_t \propto \frac{1}{\tau_{\mathrm{mix}}\log(t)\sqrt{t}}$ that depends on the mixing time but requires no prior knowledge of the curvature parameter $\omega$. Our first result shows that such a choice of the stepsize guarantees that the TD(0) iterates are automatically and uniformly bounded with high probability, without projections and without any stability argument based on $\omega$. Building on this result, we establish a simultaneous high-probability convergence guarantee for the PR average: the same stepsize yields both a robust curvature-free $\widetilde{\mathcal{O}}\!\left(\frac{\tau_{\mathrm{mix}}}{\sqrt{T}}\right)$ rate and a fast curvature-dependent $\widetilde{\mathcal{O}}\!\left(\frac{\tau_{\mathrm{mix}}^2}{\omega T}\right)$rate, with the bound taking the minimum of the two. The core technical ingredient is a Poisson-equation toolkit for geometrically mixing Markov chains, which decomposes Markov noise into a martingale term plus a controlled remainder and enables a new self-bounding inductive argument for pathwise stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Single stepsize auto-bounds unprojected TD(0) iterates and delivers the min of robust and fast rates via PR averaging, using Poisson decomposition.

read the letter

The main point is that a stepsize schedule depending only on mixing time keeps the raw TD(0) iterates bounded with high probability without projections or any ω-based argument, and the averaged iterate then gets both the curvature-free Õ(τ_mix / √T) rate and the faster Õ(τ_mix² / (ω T)) rate, with the bound taking the smaller one.

They achieve this by decomposing the Markov noise via the Poisson equation into a martingale term plus a controlled remainder, then running a self-bounding induction to control the pathwise norm. That removes the usual projection step and the need to know or tune for curvature, which is the practical simplification the abstract advertises. The simultaneous rates are a clean byproduct of the same schedule.

The soft spot is the induction itself. The stress-test note is right to flag that the threshold M and the tail bounds have to close without relying on ω-driven contraction. When ω is small the linear operator gives little help, so everything rests on the accumulated stepsize sum and the martingale concentration being strong enough to keep escape probability low. The abstract claims it works, but that step is where any gap would sit, and it would need a line-by-line check.

This is for people who work on high-probability analyses of TD methods under Markov sampling. If the induction holds it removes two common obstacles at once. I would send it to referees because the claim is specific, the technical device is standard but used for a new purpose, and the result would matter inside the subfield even if it needs revision on the stability argument.

Referee Report

1 major / 2 minor

Summary. The paper claims that for linear TD(0) under single-trajectory Markovian sampling, the stepsize schedule η_t ∝ 1/(τ_mix log(t) √t) (depending only on mixing time) ensures that the unprojected iterates remain uniformly bounded with high probability, without projections or any stability argument that uses the curvature parameter ω. Building on this, the Polyak-Ruppert averaged iterates simultaneously achieve a robust curvature-free high-probability rate Õ(τ_mix/√T) and a fast curvature-dependent rate Õ(τ_mix²/(ω T)), with the bound taken as the minimum of the two. The core device is a Poisson-equation decomposition that splits the Markov noise into a martingale plus a controlled remainder, enabling a new self-bounding inductive argument for pathwise stability.

Significance. If the derivations close, the result is significant: it removes the need for projections or knowledge of ω while delivering both robustness and fast rates from one schedule. The Poisson-equation toolkit and the self-bounding induction are explicit strengths that could be reusable. The work directly addresses a practical pain point in TD learning under Markovian data.

major comments (1)

[self-bounding inductive argument for pathwise stability (core technical ingredient)] The central boundedness claim (abstract and the self-bounding argument) asserts that the induction closes for arbitrary ω>0 with a stepsize independent of ω. However, because the Poisson remainder scales with ||θ_t|| and the contraction weakens as ω→0, the choice of threshold M and the martingale tail bound must be shown to close uniformly in ω. Please provide the explicit dependence (or independence) of M on ω and the accumulated sum of η_t in the relevant lemma or proposition that establishes the high-probability event.

minor comments (2)

[statement of main results] Clarify the precise definition of the Õ notation and whether the log(t) factor in η_t is absorbed into the Õ or appears explicitly in the final bounds.
[assumptions] The geometric mixing assumption is stated as weakest; confirm that all constants in the rates are allowed to depend on the mixing parameters but not on ω.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the significance of the result and for the careful reading of the technical core. We address the single major comment below.

read point-by-point responses

Referee: The central boundedness claim (abstract and the self-bounding argument) asserts that the induction closes for arbitrary ω>0 with a stepsize independent of ω. However, because the Poisson remainder scales with ||θ_t|| and the contraction weakens as ω→0, the choice of threshold M and the martingale tail bound must be shown to close uniformly in ω. Please provide the explicit dependence (or independence) of M on ω and the accumulated sum of η_t in the relevant lemma or proposition that establishes the high-probability event.

Authors: We agree that an explicit statement of the ω-independence of M is useful for clarity. In the self-bounding argument (Proposition 4.3 and the supporting Lemma 4.2), M is chosen as M = 2||θ_0|| + C τ_mix (1 + log(1/δ)), where C is an absolute constant; this choice does not depend on ω. The Poisson remainder is controlled by the geometric mixing and appears as an additive term whose accumulated effect is bounded by O(τ_mix sum_{s=1}^t η_s), but the induction is closed by showing that the probability of a large deviation is controlled via a Freedman-type martingale inequality whose variance proxy is sum η_s^2 ≲ 1/(τ_mix^2 log^2 t) summed up to O(1/τ_mix^2). Because the step-size schedule is independent of ω and the contraction ω is never invoked in the boundedness proof (only the smallness of η_t prevents overshoot), the same M works uniformly for all ω > 0. We will add a short remark after Proposition 4.3 that records these explicit expressions for M and the accumulated sums, confirming uniformity in ω. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained with no reduction to inputs by construction

full rationale

The provided abstract and outline describe a Poisson-equation decomposition of Markov noise into martingale plus remainder, followed by a self-bounding inductive argument for pathwise stability of unprojected TD(0) iterates under a stepsize depending only on τ_mix. The resulting high-probability bounds are stated as the minimum of a curvature-free Õ(τ_mix/√T) rate and a curvature-dependent Õ(τ_mix²/(ω T)) rate; both quantities are treated as external properties of the chain and operator rather than quantities fitted or defined from the iterates themselves. No equations, self-citations, or ansatzes are shown that would reduce the claimed rates or stability claim to a tautology or to a parameter estimated from the target result. The paper explicitly asserts the argument avoids any ω-based stability step, and the given text supplies no counter-evidence of circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the standard domain assumption of geometric mixing for the Markov chain; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The data-generating Markov chain is geometrically mixing.
Required for the Poisson-equation decomposition of Markov noise into martingale plus controlled remainder.

pith-pipeline@v0.9.1-grok · 5796 in / 1295 out tokens · 54157 ms · 2026-06-26T00:37:35.914428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Conference on Learning Theory , pages=

Stochastic linear optimization never overfits with quadratically-bounded losses on general data , author=. Conference on Learning Theory , pages=. 2022 , organization=

2022
[2]

2023 , organization=

Ivgi, Maor and Hinder, Oliver and Carmon, Yair , booktitle=. 2023 , organization=

2023
[3]

Conference on Learning Theory , pages=

A finite time analysis of temporal difference learning with linear function approximation , author=. Conference on Learning Theory , pages=. 2018 , organization=

2018
[4]

International Conference on Machine Learning , pages=

Temporal difference learning as gradient splitting , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[5]

A Simple Finite-Time Analysis of

Mitra, Aritra , journal=. A Simple Finite-Time Analysis of. 2025 , volume=

2025
[6]

arXiv preprint arXiv:2402.13903 , year=

Dealing with unbounded gradients in stochastic saddle-point optimization , author=. arXiv preprint arXiv:2402.13903 , year=

work page arXiv
[7]

International Conference on Machine Learning , pages=

Unconstrained online learning with unbounded losses , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[8]

Stochastic gradient descent under

Even, Mathieu , booktitle=. Stochastic gradient descent under. 2023 , organization=

2023
[9]

Black-box reductions for parameter-free online learning in

Cutkosky, Ashok and Orabona, Francesco , booktitle=. Black-box reductions for parameter-free online learning in. 2018 , organization=

2018
[10]

2026 , publisher =

Mannor, Shie and Mansour, Yishay and Tamar, Aviv , title =. 2026 , publisher =

2026
[11]

Machine Learning , volume=

Learning to predict by the methods of temporal differences , author=. Machine Learning , volume=. 1988 , publisher=

1988
[12]

Logarithmic

Diaconis, Persi and Saloff-Coste, Laurent , journal=. Logarithmic. 1996 , publisher=

1996
[13]

Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

Approximate temporal difference learning is a gradient descent for reversible policies , author=. arXiv preprint arXiv:1805.00869 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Advances in Neural Information Processing Systems , volume=

Analysis of temporal-difference learning with function approximation , author=. Advances in Neural Information Processing Systems , volume=
[15]

Journal of Artificial Intelligence Research , volume=

Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity , author=. Journal of Artificial Intelligence Research , volume=
[16]

Finite-time error bounds for linear stochastic approximation and

Srikant, Rayadurgam and Ying, Lei , booktitle=. Finite-time error bounds for linear stochastic approximation and. 2019 , organization=

2019
[17]

Mastering the game of

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and Ka...

2016
[18]

2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , author=. 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2017 , organization=

2017
[19]

Proceedings of the IEEE International Conference on Computer Vision , pages=

Deepdriving: Learning affordance for direct perception in autonomous driving , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=
[20]

Wiley Interdisciplinary Reviews: Computational Statistics , volume=

Stochastic approximation: a survey , author=. Wiley Interdisciplinary Reviews: Computational Statistics , volume=. 2010 , publisher=

2010
[21]

Finite sample analyses for

Dalal, Gal and Sz. Finite sample analyses for. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[22]

International Conference on Artificial Intelligence and Statistics , pages=

Linear stochastic approximation: How far does constant step-size and iterate averaging go? , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

2018
[23]

Korda, Nathaniel and La, Prashanth , booktitle=. On. 2015 , organization=

2015
[24]

state-of-the-art

Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results , author=. 2017 , institution=

2017
[25]

arXiv preprint arXiv:2102.00236 , year=

Parameter-free Stochastic Optimization of Variationally Coherent Functions , author =. arXiv preprint arXiv:2102.00236 , year=

work page arXiv
[26]

2017 , publisher=

Markov chains and mixing times , author=. 2017 , publisher=

2017
[27]

Journal of Machine Learning Research , volume =

Xiao, Lin , title =. Journal of Machine Learning Research , volume =. 2010 , pages =

2010
[28]

arXiv preprint arXiv:2505.21391 , year=

Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features , author=. arXiv preprint arXiv:2505.21391 , year=

work page arXiv
[29]

A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with

Khodadadian, Sajad and Zubeldia, Martin , journal=. A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with
[30]

The Thirty Seventh Annual Conference on Learning Theory , pages=

Improved high-probability bounds for the temporal difference learning algorithm via exponential stability , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

2024
[31]

SIAM Journal on Mathematics of Data Science , volume=

Accelerated and instance-optimal policy evaluation with linear function approximation , author=. SIAM Journal on Mathematics of Data Science , volume=. 2023 , publisher=

2023
[32]

International Conference on Artificial Intelligence and Statistics , pages=

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

2023
[33]

Automatica , volume=

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning , author=. Automatica , volume=. 2022 , publisher=

2022
[34]

1998 , publisher=

Reinforcement Learning: An Introduction , author=. 1998 , publisher=

1998
[35]

SIAM Journal on Optimization , volume=

Robust stochastic approximation approach to stochastic programming , author=. SIAM Journal on Optimization , volume=. 2009 , publisher=

2009
[36]

Least squares regression with

Nagaraj, Dheeraj and Wu, Xian and Bresler, Guy and Jain, Prateek and Netrapalli, Praneeth , journal=. Least squares regression with
[37]

A Robust

Lee, Wei-Cheng and Orabona, Francesco , journal=. A Robust
[38]

Chernoff-type bound for finite

Lezaud, Pascal , journal=. Chernoff-type bound for finite. 1998 , publisher=

1998
[39]

2024 , url =

Harin Lee , title =. 2024 , url =

2024
[40]

Towards Weaker Variance Assumptions for Stochastic Optimization

Towards weaker variance assumptions for stochastic optimization , author=. arXiv preprint arXiv:2504.09951 , year=

work page internal anchor Pith review arXiv
[41]

arXiv preprint arXiv:2511.19937 , year=

Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization , author=. arXiv preprint arXiv:2511.19937 , year=

work page arXiv
[42]

A Unified Stability Analysis of

Chang, Wei-Kai and Khanna, Rajiv , journal=. A Unified Stability Analysis of
[43]

Optimum bounds for the distributions of martingales in

Pinelis, Iosif , journal=. Optimum bounds for the distributions of martingales in. 1994 , publisher=

1994
[44]

The Annals of Probability , pages=

On tail probabilities for martingales , author=. The Annals of Probability , pages=. 1975 , publisher=

1975
[45]

Freedman's inequality for matrix martingales , author=
[46]

Advances in Neural Information Processing Systems , volume=

Statistical efficiency of distributional temporal difference learning , author=. Advances in Neural Information Processing Systems , volume=
[47]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Adaptive temporal difference learning with linear function approximation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=

2021
[48]

Advances in Neural Information Processing Systems , volume=

Finite-time analysis of adaptive temporal difference learning with deep neural networks , author=. Advances in Neural Information Processing Systems , volume=
[49]

Journal of Machine Learning Research , volume=

Adaptive subgradient methods for online learning and stochastic optimization , author=. Journal of Machine Learning Research , volume=
[50]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Adaptive Bound Optimization for Online Convex Optimization

Adaptive bound optimization for online convex optimization , author=. arXiv preprint arXiv:1002.4908 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

International Conference on Machine Learning , pages=

A convergence theory for deep learning via over-parameterization , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[53]

Advances in Neural Information Processing Systems , volume=

High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=
[54]

2012 , publisher=

Markov chains and stochastic stability , author=. 2012 , publisher=

2012
[55]

arXiv preprint arXiv:2603.02577 , year=

Towards Parameter-Free Temporal Difference Learning , author=. arXiv preprint arXiv:2603.02577 , year=

work page arXiv
[56]

Mathematics of Operations Research , volume =

Durmus, Alain and Moulines, Eric and Naumov, Alexey and Samsonov, Sergey , title =. Mathematics of Operations Research , volume =
[57]

High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation , year=

Li, Gen and Wu, Weichen and Chi, Yuejie and Ma, Cong and Rinaldo, Alessandro and Wei, Yuting , journal=. High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation , year=
[58]

Machine Learning , volume =

Concentration Bounds for Temporal Difference Learning with Linear Function Approximation: The Case of Batch Data and Uniform Sampling , author =. Machine Learning , volume =
[59]

On Linear Stochastic Approximation: Fine-grained

Mou, Wenlong and Li, Chris Junchi and Wainwright, Martin J and Bartlett, Peter L and Jordan, Michael I , booktitle =. On Linear Stochastic Approximation: Fine-grained
[60]

Asymptotic and finite sample analysis of nonexpansive stochastic approximations with

Blaser, Ethan and Zhang, Shangtong , journal=. Asymptotic and finite sample analysis of nonexpansive stochastic approximations with
[61]

, title =

Chandak, Siddharth and Borkar, Vivek S. , title =. Stochastic Systems , doi =
[62]

, title =

Konda, Vijaymohan R. , title =. 2002 , address =

2002
[63]

Sample complexity of asynchronous

Li, Gen and Wei, Yuting and Chi, Yuejie and Gu, Yuantao and Chen, Yuxin , journal=. Sample complexity of asynchronous
[64]

Faster rates, adaptive algorithms, and finite-time bounds for linear composition optimization and gradient

Raj, Anant and Joulani, Pooria and Gyorgy, Andras and Szepesv. Faster rates, adaptive algorithms, and finite-time bounds for linear composition optimization and gradient. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

2022

[1] [1]

Conference on Learning Theory , pages=

Stochastic linear optimization never overfits with quadratically-bounded losses on general data , author=. Conference on Learning Theory , pages=. 2022 , organization=

2022

[2] [2]

2023 , organization=

Ivgi, Maor and Hinder, Oliver and Carmon, Yair , booktitle=. 2023 , organization=

2023

[3] [3]

Conference on Learning Theory , pages=

A finite time analysis of temporal difference learning with linear function approximation , author=. Conference on Learning Theory , pages=. 2018 , organization=

2018

[4] [4]

International Conference on Machine Learning , pages=

Temporal difference learning as gradient splitting , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[5] [5]

A Simple Finite-Time Analysis of

Mitra, Aritra , journal=. A Simple Finite-Time Analysis of. 2025 , volume=

2025

[6] [6]

arXiv preprint arXiv:2402.13903 , year=

Dealing with unbounded gradients in stochastic saddle-point optimization , author=. arXiv preprint arXiv:2402.13903 , year=

work page arXiv

[7] [7]

International Conference on Machine Learning , pages=

Unconstrained online learning with unbounded losses , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[8] [8]

Stochastic gradient descent under

Even, Mathieu , booktitle=. Stochastic gradient descent under. 2023 , organization=

2023

[9] [9]

Black-box reductions for parameter-free online learning in

Cutkosky, Ashok and Orabona, Francesco , booktitle=. Black-box reductions for parameter-free online learning in. 2018 , organization=

2018

[10] [10]

2026 , publisher =

Mannor, Shie and Mansour, Yishay and Tamar, Aviv , title =. 2026 , publisher =

2026

[11] [11]

Machine Learning , volume=

Learning to predict by the methods of temporal differences , author=. Machine Learning , volume=. 1988 , publisher=

1988

[12] [12]

Logarithmic

Diaconis, Persi and Saloff-Coste, Laurent , journal=. Logarithmic. 1996 , publisher=

1996

[13] [13]

Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

Approximate temporal difference learning is a gradient descent for reversible policies , author=. arXiv preprint arXiv:1805.00869 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Advances in Neural Information Processing Systems , volume=

Analysis of temporal-difference learning with function approximation , author=. Advances in Neural Information Processing Systems , volume=

[15] [15]

Journal of Artificial Intelligence Research , volume=

Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity , author=. Journal of Artificial Intelligence Research , volume=

[16] [16]

Finite-time error bounds for linear stochastic approximation and

Srikant, Rayadurgam and Ying, Lei , booktitle=. Finite-time error bounds for linear stochastic approximation and. 2019 , organization=

2019

[17] [17]

Mastering the game of

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and Ka...

2016

[18] [18]

2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , author=. 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2017 , organization=

2017

[19] [19]

Proceedings of the IEEE International Conference on Computer Vision , pages=

Deepdriving: Learning affordance for direct perception in autonomous driving , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

[20] [20]

Wiley Interdisciplinary Reviews: Computational Statistics , volume=

Stochastic approximation: a survey , author=. Wiley Interdisciplinary Reviews: Computational Statistics , volume=. 2010 , publisher=

2010

[21] [21]

Finite sample analyses for

Dalal, Gal and Sz. Finite sample analyses for. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[22] [22]

International Conference on Artificial Intelligence and Statistics , pages=

Linear stochastic approximation: How far does constant step-size and iterate averaging go? , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

2018

[23] [23]

Korda, Nathaniel and La, Prashanth , booktitle=. On. 2015 , organization=

2015

[24] [24]

state-of-the-art

Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results , author=. 2017 , institution=

2017

[25] [25]

arXiv preprint arXiv:2102.00236 , year=

Parameter-free Stochastic Optimization of Variationally Coherent Functions , author =. arXiv preprint arXiv:2102.00236 , year=

work page arXiv

[26] [26]

2017 , publisher=

Markov chains and mixing times , author=. 2017 , publisher=

2017

[27] [27]

Journal of Machine Learning Research , volume =

Xiao, Lin , title =. Journal of Machine Learning Research , volume =. 2010 , pages =

2010

[28] [28]

arXiv preprint arXiv:2505.21391 , year=

Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features , author=. arXiv preprint arXiv:2505.21391 , year=

work page arXiv

[29] [29]

A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with

Khodadadian, Sajad and Zubeldia, Martin , journal=. A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with

[30] [30]

The Thirty Seventh Annual Conference on Learning Theory , pages=

Improved high-probability bounds for the temporal difference learning algorithm via exponential stability , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

2024

[31] [31]

SIAM Journal on Mathematics of Data Science , volume=

Accelerated and instance-optimal policy evaluation with linear function approximation , author=. SIAM Journal on Mathematics of Data Science , volume=. 2023 , publisher=

2023

[32] [32]

International Conference on Artificial Intelligence and Statistics , pages=

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

2023

[33] [33]

Automatica , volume=

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning , author=. Automatica , volume=. 2022 , publisher=

2022

[34] [34]

1998 , publisher=

Reinforcement Learning: An Introduction , author=. 1998 , publisher=

1998

[35] [35]

SIAM Journal on Optimization , volume=

Robust stochastic approximation approach to stochastic programming , author=. SIAM Journal on Optimization , volume=. 2009 , publisher=

2009

[36] [36]

Least squares regression with

Nagaraj, Dheeraj and Wu, Xian and Bresler, Guy and Jain, Prateek and Netrapalli, Praneeth , journal=. Least squares regression with

[37] [37]

A Robust

Lee, Wei-Cheng and Orabona, Francesco , journal=. A Robust

[38] [38]

Chernoff-type bound for finite

Lezaud, Pascal , journal=. Chernoff-type bound for finite. 1998 , publisher=

1998

[39] [39]

2024 , url =

Harin Lee , title =. 2024 , url =

2024

[40] [40]

Towards Weaker Variance Assumptions for Stochastic Optimization

Towards weaker variance assumptions for stochastic optimization , author=. arXiv preprint arXiv:2504.09951 , year=

work page internal anchor Pith review arXiv

[41] [41]

arXiv preprint arXiv:2511.19937 , year=

Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization , author=. arXiv preprint arXiv:2511.19937 , year=

work page arXiv

[42] [42]

A Unified Stability Analysis of

Chang, Wei-Kai and Khanna, Rajiv , journal=. A Unified Stability Analysis of

[43] [43]

Optimum bounds for the distributions of martingales in

Pinelis, Iosif , journal=. Optimum bounds for the distributions of martingales in. 1994 , publisher=

1994

[44] [44]

The Annals of Probability , pages=

On tail probabilities for martingales , author=. The Annals of Probability , pages=. 1975 , publisher=

1975

[45] [45]

Freedman's inequality for matrix martingales , author=

[46] [46]

Advances in Neural Information Processing Systems , volume=

Statistical efficiency of distributional temporal difference learning , author=. Advances in Neural Information Processing Systems , volume=

[47] [47]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Adaptive temporal difference learning with linear function approximation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=

2021

[48] [48]

Advances in Neural Information Processing Systems , volume=

Finite-time analysis of adaptive temporal difference learning with deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

[49] [49]

Journal of Machine Learning Research , volume=

Adaptive subgradient methods for online learning and stochastic optimization , author=. Journal of Machine Learning Research , volume=

[50] [50]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Adaptive Bound Optimization for Online Convex Optimization

Adaptive bound optimization for online convex optimization , author=. arXiv preprint arXiv:1002.4908 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

International Conference on Machine Learning , pages=

A convergence theory for deep learning via over-parameterization , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019

[53] [53]

Advances in Neural Information Processing Systems , volume=

High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=

[54] [54]

2012 , publisher=

Markov chains and stochastic stability , author=. 2012 , publisher=

2012

[55] [55]

arXiv preprint arXiv:2603.02577 , year=

Towards Parameter-Free Temporal Difference Learning , author=. arXiv preprint arXiv:2603.02577 , year=

work page arXiv

[56] [56]

Mathematics of Operations Research , volume =

Durmus, Alain and Moulines, Eric and Naumov, Alexey and Samsonov, Sergey , title =. Mathematics of Operations Research , volume =

[57] [57]

High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation , year=

Li, Gen and Wu, Weichen and Chi, Yuejie and Ma, Cong and Rinaldo, Alessandro and Wei, Yuting , journal=. High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation , year=

[58] [58]

Machine Learning , volume =

Concentration Bounds for Temporal Difference Learning with Linear Function Approximation: The Case of Batch Data and Uniform Sampling , author =. Machine Learning , volume =

[59] [59]

On Linear Stochastic Approximation: Fine-grained

Mou, Wenlong and Li, Chris Junchi and Wainwright, Martin J and Bartlett, Peter L and Jordan, Michael I , booktitle =. On Linear Stochastic Approximation: Fine-grained

[60] [60]

Asymptotic and finite sample analysis of nonexpansive stochastic approximations with

Blaser, Ethan and Zhang, Shangtong , journal=. Asymptotic and finite sample analysis of nonexpansive stochastic approximations with

[61] [61]

, title =

Chandak, Siddharth and Borkar, Vivek S. , title =. Stochastic Systems , doi =

[62] [62]

, title =

Konda, Vijaymohan R. , title =. 2002 , address =

2002

[63] [63]

Sample complexity of asynchronous

Li, Gen and Wei, Yuting and Chi, Yuejie and Gu, Yuantao and Chen, Yuxin , journal=. Sample complexity of asynchronous

[64] [64]

Faster rates, adaptive algorithms, and finite-time bounds for linear composition optimization and gradient

Raj, Anant and Joulani, Pooria and Gyorgy, Andras and Szepesv. Faster rates, adaptive algorithms, and finite-time bounds for linear composition optimization and gradient. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

2022