Non-Asymptotic Convergence of Stochastic Iterative Algorithms: A Lyapunov Framework

Siva Theja Maguluri; Zaiwei Chen

arxiv: 2605.31309 · v1 · pith:TFBKX346new · submitted 2026-05-29 · 💻 cs.LG · math.PR· stat.ML

Non-Asymptotic Convergence of Stochastic Iterative Algorithms: A Lyapunov Framework

Zaiwei Chen , Siva Theja Maguluri This is my paper

Pith reviewed 2026-06-28 23:09 UTC · model grok-4.3

classification 💻 cs.LG math.PRstat.ML

keywords stochastic approximationLyapunov functionsfinite-time analysisMoreau envelopesreinforcement learningQ-learningtemporal difference learning

0 comments

The pith

Generalized Moreau envelopes serve as universal Lyapunov functions for finite-time analysis of stochastic approximation algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys Lyapunov techniques for proving non-asymptotic convergence of stochastic iterative algorithms that solve fixed-point equations using noisy oracles. It demonstrates that generalized Moreau envelopes function as Lyapunov functions no matter which norm makes the operator contractive, and this yields mean-square error bounds. The same construction covers stochastic gradient descent, linear stochastic approximation, Q-learning, and temporal-difference learning. The survey also sketches how the approach extends to Markovian noise, seminorm contractive operators, dissipative operators, and high-probability statements.

Core claim

Generalized Moreau envelopes serve as universal Lyapunov functions, regardless of the underlying norm, and yield mean-square convergence guarantees for stochastic gradient descent, linear SA, and value-based reinforcement learning algorithms such as Q-learning and temporal-difference learning.

What carries the argument

generalized Moreau envelopes used as Lyapunov functions that certify contraction of the mean-field operator under additive noise

If this is right

Mean-square convergence rates follow directly for stochastic gradient descent on strongly convex problems.
The same rates apply to linear stochastic approximation and to value-based methods including Q-learning and TD learning.
The framework extends to Markovian noise and to operators that are contractive only in a seminorm.
High-probability bounds can be obtained by the same Lyapunov construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The universal character of the envelopes may simplify the derivation of rates for new variants of stochastic approximation that have not yet been analyzed.
The connection to dissipative operators suggests the technique could reach certain non-contractive but stable dynamical systems.
The open problems listed at the end indicate that high-probability and almost-sure statements remain less developed than mean-square ones.

Load-bearing premise

The fixed-point operator is contractive with respect to some norm and the noise is independent and identically distributed.

What would settle it

An explicit counter-example in which the operator fails to be contractive in every norm yet mean-square convergence still holds, or in which the operator is contractive but the Moreau-envelope Lyapunov function fails to decrease.

read the original abstract

We survey Lyapunov-based techniques for the finite-time analysis of stochastic iterative algorithms, also known as stochastic approximation (SA) algorithms, for solving fixed-point equations $\bar{F}(x)=x$, where the operator $\bar{F}(\cdot)$ can only be accessed through a noisy oracle. We first focus on the standard setting in which $\bar{F}(\cdot)$ is contractive with respect to some norm and the noise is i.i.d., and explain how generalized Moreau envelopes serve as universal Lyapunov functions, regardless of the underlying norm. We then show how this framework yields mean-square convergence guarantees and applies to stochastic gradient descent, linear SA, and value-based reinforcement learning algorithms such as Q-learning and temporal-difference learning. Finally, we discuss extensions to Markovian noise, seminorm-contractive operators, dissipative operators, and high-probability bounds, and conclude with open problems. The goal is to present a unified and self-contained roadmap for the finite-time analysis of SA and its applications, especially in reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward survey that unifies existing Lyapunov tools for stochastic approximation but adds no new theorems or derivations.

read the letter

The core takeaway is that this paper surveys Lyapunov-based finite-time analysis for stochastic approximation algorithms solving fixed-point equations. It highlights how generalized Moreau envelopes can serve as Lyapunov functions for contractive operators under i.i.d. noise, then applies the setup to SGD, linear SA, Q-learning, and TD learning before covering extensions like Markovian noise and high-probability bounds.

It does a decent job pulling the material into one self-contained roadmap. Readers who already know the individual results might still appreciate having the common structure laid out clearly, especially the way the framework adapts to different norms and the separation of standard cases from the extensions.

The main limitation is that the work is explicitly a survey with no new derivations or results claimed beyond the organization. The central claims rest on prior literature, so the contribution is presentational rather than technical. Soundness appears consistent within the stated scope of contractive operators and i.i.d. noise, but full verification would need the actual proof sketches and any error-bound details.

This is aimed at researchers in optimization and RL who want a consolidated reference for finite-time SA analysis rather than a source of fresh theorems. Someone already working in the area could get practical value from the unified view.

I would send it for peer review. A clear survey that organizes tools across SGD and RL can still be worth referee time even if the novelty is modest.

Referee Report

0 major / 3 minor

Summary. The manuscript surveys Lyapunov-based techniques for the finite-time analysis of stochastic iterative algorithms (stochastic approximation or SA) for solving fixed-point equations ¯{F}(x)=x accessed via noisy oracles. It focuses on the standard setting where ¯{F} is contractive w.r.t. some norm and noise is i.i.d., demonstrating that generalized Moreau envelopes serve as universal Lyapunov functions yielding mean-square convergence. Applications to stochastic gradient descent, linear SA, Q-learning, and temporal-difference learning are discussed, along with extensions to Markovian noise, seminorm-contractive operators, dissipative operators, high-probability bounds, and open problems. The aim is a unified, self-contained roadmap for such analyses, especially in reinforcement learning.

Significance. If the presented framework holds, the survey would be significant for providing a unified view of finite-time analysis techniques using generalized Moreau envelopes as Lyapunov functions across different norms and algorithms. This could facilitate the application of these methods in optimization and RL by offering a consistent approach rather than algorithm-specific analyses. The paper's structure as a survey that builds on existing literature while aiming for self-containment is a positive aspect.

minor comments (3)

Abstract: The term 'generalized Moreau envelopes' is introduced without a brief definition or pointer to its construction; adding one sentence would improve accessibility.
Introduction (first paragraph of survey description): The claim that the framework 'yields mean-square convergence guarantees' for the listed algorithms should include a forward reference to the specific section where the rates or conditions are stated, even if drawn from prior work.
Notation throughout: Ensure consistent use of the bar on F and the distinction between the contractive operator and its noisy version; minor inconsistencies here could confuse readers following the Lyapunov construction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report does not raise any specific major comments.

Circularity Check

0 steps flagged

No significant circularity: survey of prior techniques

full rationale

The paper is explicitly framed as a survey that unifies existing Lyapunov techniques for stochastic approximation under contractive operators. The central claim—that generalized Moreau envelopes serve as universal Lyapunov functions for mean-square convergence in the standard i.i.d. noise setting—restates and organizes results from the literature rather than deriving new predictions from parameters fitted inside this manuscript. No equations reduce a claimed result to a self-defined fit, no load-bearing uniqueness theorem is imported solely via self-citation, and the applications to SGD, linear SA, Q-learning, and TD learning are standard fixed-point reformulations. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is a survey paper; the ledger records the domain assumptions stated in the abstract for the standard setting. No new free parameters or invented entities are introduced by the survey itself.

axioms (2)

domain assumption The operator ar{F} is contractive with respect to some norm
Stated explicitly as the standard setting in the abstract.
domain assumption Noise is i.i.d.
Stated explicitly for the standard setting in the abstract.

pith-pipeline@v0.9.1-grok · 5707 in / 1240 out tokens · 23959 ms · 2026-06-28T23:09:16.663871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

126 extracted references · 14 canonical work pages · 5 internal anchors

[1]

P., and Borkar, V

Abounadi, J., Bertsekas, D. P., and Borkar, V. (2002). Stochastic approximation for nonexpan- sive maps: Application to q-learning algorithms.SIAM Journal on Control and Optimization, 41(1):1–22

2002
[2]

M., Crump, T., and Far, B

Afsar, M. M., Crump, T., and Far, B. (2022). Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38

2022
[3]

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

Agrawal, S., Maguluri, S. T., and Zubeldia, M. (2026). Concentration of general stochastic approximation under heavy-tailed Markovian noise.Preprint Arxiv: 2605.20999

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine learning proceedings 1995, pages 30–37. Elsevier

1995
[5]

Banach, S. (1922). Sur les op´ erations dans les ensembles abstraits et leur application aux ´ equations int´ egrales.Fund. math, 3(1):133–181

1922
[6]

and Gupta, A

Bansal, N. and Gupta, A. (2019). Potential-function proofs for gradient methods.Theory of Computing, 15(1):1–32. 27

2019
[7]

and Laraki, R

Baudin, L. and Laraki, R. (2022). Fictitious play and best-response dynamics in identical interest and zero-sum stochastic games. InInternational Conference on Machine Learning, pages 1664–1690. PMLR

2022
[8]

(2017).First-Order Methods in Optimization

Beck, A. (2017).First-Order Methods in Optimization. SIAM

2017
[9]

Bellman, R. (1957). Dynamic programming.Press Princeton, New Jersey, 39

1957
[10]

Bena¨ ım, M., Hofbauer, J., and Sorin, S. (2005). Stochastic approximations and differential inclusions.SIAM Journal on Control and Optimization, 44(1):328–348

2005
[11]

Bena¨ ım, M., Hofbauer, J., and Sorin, S. (2006). Stochastic approximations and differential inclusions, part ii: Applications.Mathematics of Operations Research, 31(4):673–695

2006
[12]

(2012).Adaptive Algorithms and Stochastic Approximations, volume 22

Benveniste, A., M´ etivier, M., and Priouret, P. (2012).Adaptive Algorithms and Stochastic Approximations, volume 22. Springer Science & Business Media

2012
[13]

Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-Dynamic Programming. Athena Scientific

1996
[14]

and Zhang, S

Blaser, E. and Zhang, S. (2026). Asymptotic and finite sample analysis of nonexpansive stochastic approximations with markovian noise. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 19764–19772

2026
[15]

Border, K. C. (1985).Fixed-Point Theorems with Applications to Economics and Game Theory. Cambridge university press

1985
[16]

Borkar, V., Chen, S., Devraj, A., Kontoyiannis, I., and Meyn, S. (2025). The ode method for asymptotic statistics in stochastic approximation and reinforcement learning.The Annals of Applied Probability, 35(2):936–982

2025
[17]

Borkar, V. S. (2009).Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer

2009
[18]

Borkar, V. S. (2021). A concentration bound for contractive stochastic approximation.Systems & Control Letters, 153:104947

2021
[19]

Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approx- imation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469

2000
[20]

E., and Nocedal, J

Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning.Siam Review, 60(2):223–311

2018
[21]

and Cominetti, R

Bravo, M. and Cominetti, R. (2024). Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219

2024
[22]

C., Gerencs´ er, B., Gerencs´ er, L., and R´ asonyi, M

Car` e, A., Cs´ aji, B. C., Gerencs´ er, B., Gerencs´ er, L., and R´ asonyi, M. (2026). Stochastic approximation in a markovian framework revisited: Lipschitz continuity of the poisson equation. Mathematics of Control, Signals, and Systems, pages 1–43

2026
[23]

S., and Dodhia, P

Chandak, S., Borkar, V. S., and Dodhia, P. (2022). Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430. 28

2022
[24]

U., and Bambos, N

Chandak, S., Haque, S. U., and Bambos, N. (2025). Finite-time bounds for two-time-scale stochastic approximation with arbitrary norm contractions and markovian noise. In2025 IEEE 64th Conference on Decision and Control (CDC), pages 6095–6101. IEEE

2025
[25]

Chen, Z., Clarke, J.-P., and Maguluri, S. T. (2023). Target network and truncation overcome the deadly triad in q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101

2023
[26]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes.Advances in Neural Information Processing Systems, 33:8223–8234

2020
[27]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2021). Finite-sample analysis of off-policy td-learning via generalized bellman operators.Advances in Neural Information Processing Systems, 34:21440–21452

2021
[28]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov the- ory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352–1367

2024
[29]

T., and Zubeldia, M

Chen, Z., Maguluri, S. T., and Zubeldia, M. (2025a). Concentration of contractive stochastic approximation: Additive and multiplicative noise.The Annals of Applied Probability, 35(2):1298– 1352
[30]

T., Clarke, J.-P., and Maguluri, S

Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., and Maguluri, S. T. (2022). Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Au- tomatica, 146:110623

2022
[31]

U., and Maguluri, S

Chen, Z., Zhang, S., Zhang, Z., Haque, S. U., and Maguluri, S. T. (2025b). A non-asymptotic theory of seminorm lyapunov stability: From deterministic to stochastic iterative algorithms. Preprint Arxiv:2502.14208

work page arXiv
[32]

Chung, K. L. (1954). On a stochastic approximation method.The Annals of Mathematical Statistics, pages 463–483

1954
[33]

Cutler, J., Diaz, M., and Drusvyatskiy, D. (2024). Stochastic approximation with decision- dependent distributions: Asymptotic normality and optimality.Journal of Machine Learning Research, 25(90):1–49

2024
[34]

Dalal, G., Thoppe, G., Sz¨ or´ enyi, B., and Mannor, S. (2018). Finite sample analysis of two- timescale stochastic approximation with applications to reinforcement learning. InConference On Learning Theory, pages 1199–1233. PMLR

2018
[35]

Davis, D., Drusvyatskiy, D., and Jiang, L. (2024). Asymptotic normality and optimality in nonsmooth stochastic approximation.The Annals of Statistics, 52(4):1485–1508

2024
[36]

Deb, R., Ganesh, S., and Bhatnagar, S. (2025). Multi-timescale stochastic approximation: Stability and convergence.Preprint Arxiv:2112.03515. 29

work page arXiv 2025
[37]

Doan, T. T. (2021). Finite-time analysis and restarting scheme for linear two-time-scale stochastic approximation.SIAM Journal on Control and Optimization, 59(4):2798–2819

2021
[38]

Doan, T. T. (2022a). Finite-time analysis of markov gradient descent.IEEE Transactions on Automatic Control, 68(4):2140–2153
[39]

Doan, T. T. (2022b). Nonlinear two-time-scale stochastic approximation: Convergence and finite-time performance.IEEE Transactions on Automatic Control, 68(8):4695–4705
[40]

(2018).Markov Chains, volume 4

Douc, R., Moulines, E., Priouret, P., and Soulier, P. (2018).Markov Chains, volume 4. Springer

2018
[41]

C., Agarwal, A., Johansson, M., and Jordan, M

Duchi, J. C., Agarwal, A., Johansson, M., and Jordan, M. I. (2012). Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578

2012
[42]

Durmus, A., Moulines, E., Naumov, A., and Samsonov, S. (2025). Finite-time high-probability bounds for polyak–ruppert averaged iterates of linear stochastic approximation.Mathematics of Operations Research, 50(2):935–964

2025
[43]

Durmus, A., Moulines, E., Naumov, A., Samsonov, S., Scaman, K., and Wai, H.-T. (2021). Tight high probability bounds for linear stochastic approximation with fixed stepsize.Advances in Neural Information Processing Systems, 34:30063–30074

2021
[44]

(2019).Probability: Theory and Examples, volume 49

Durrett, R. (2019).Probability: Theory and Examples, volume 49. Cambridge university press

2019
[45]

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. InInternational Conference on Machine Learning, pages 1407–1416

2018
[46]

and Mansour, Y

Even-Dar, E. and Mansour, Y. (2003). Learning rates forQ-learning.Journal of Machine Learning Research, 5(Dec):1–25

2003
[47]

Fabian, V. (1968). On asymptotic normality in stochastic approximation.The Annals of Mathematical Statistics, pages 1327–1332

1968
[48]

Fort, G. (2015). Central limit theorems for stochastic approximation with controlled markov chain dynamics.ESAIM: Probability and Statistics, 19:60–80

2015
[49]

Gheshlaghi Azar, M., Munos, R., and Kappen, H. J. (2013). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349

2013
[50]

Glynn, P. W. and Iglehart, D. L. (1989). Importance sampling for stochastic simulations. Management science, 35(11):1367–1392

1989
[51]

and Thoppe, G

Gopalan, A. and Thoppe, G. (2023). Demystifying approximate reinforcement learning with ϵ-greedy exploration: A differential inclusion view

2023
[52]

Gosavi, A. (2006). Boundedness of iterates inQ-learning.Systems & control letters, 55(4):347– 349. 30

2006
[53]

and Nemirovski, A

Guzm´ an, C. and Nemirovski, A. (2015). On lower complexity bounds for large-scale smooth convex optimization.Journal of Complexity, 31(1):1–14

2015
[54]

Haddad, W. M. and Chellaboina, V. (2011).Nonlinear Dynamical Systems and Control: A Lyapunov-Based Approach. Princeton University Press

2011
[55]

Haque, S. U. and Maguluri, S. T. (2025). Stochastic approximation with unbounded markovian noise: A general-purpose theorem. InProceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 ofProceedings of Machine Learning Research, pages 3718–3726. PMLR

2025
[56]

G., Stepleton, T., and Munos, R

Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. (2016). Q(λ) with off- policy corrections. InInternational Conference on Algorithmic Learning Theory, pages 305–320. Springer

2016
[57]

Harvey, N. J. A., Liaw, C., and Randhawa, S. (2019). Simple and optimal high-probability bounds for strongly-convex stochastic gradient descent.Preprint Arxiv:1909.00843

work page arXiv 2019
[58]

and Kale, S

Hazan, E. and Kale, S. (2014). Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization.Journal of Machine Learning Research, 15:2489–2512

2014
[59]

and Sandholm, W

Hofbauer, J. and Sandholm, W. H. (2002). On the global convergence of stochastic fictitious play.Econometrica, 70(6):2265–2294

2002
[60]

A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

Jin, C., Netrapalli, P., Ge, R., Kakade, S. M., and Jordan, M. I. (2019). A short note on concentration inequalities for random vectors with subgaussian norm.Preprint arXiv:1902.03736

work page internal anchor Pith review Pith/arXiv arXiv 2019
[61]

Kaledin, M., Moulines, E., Naumov, A., Tadic, V., and Wai, H.-T. (2020). Finite time anal- ysis of linear two-timescale stochastic approximation with markovian noise. InConference on Learning Theory, pages 2144–2203. PMLR

2020
[62]

Khalil, H. K. and Grizzle, J. W. (2002).Nonlinear Systems, volume 3. Prentice hall Upper Saddle River, NJ

2002
[63]

Khodadadian, S., Sharma, P., Joshi, G., and Maguluri, S. T. (2022). Federated reinforcement learning: Linear speedup under markovian sampling. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 10997–11057. PMLR

2022
[64]

Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. InAdvances in neural information processing systems, pages 1008–1014. Citeseer

2000
[65]

Konda, V. R. and Tsitsiklis, J. N. (2004). Convergence rate of linear two-time-scale stochastic approximation.The Annals of Applied Probability, 14(2):796–819

2004
[66]

Kushner, H. J. and Clark, D. S. (2012).Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26. Springer Science & Business Media

2012
[67]

and Lefschetz, S

La Salle, J. and Lefschetz, S. (2012).Stability by Liapunov’s Direct Method with Applications, volume 4. Elsevier. 31

2012
[68]

and Szepesv´ ari, C

Lakshminarayanan, C. and Szepesv´ ari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? InProceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 ofProceedings of Machine Learning Research, pages 1347–1355. PMLR

2018
[69]

(2020).First-Order and Stochastic Optimization Methods for Machine Learning

Lan, G. (2020).First-Order and Stochastic Optimization Methods for Machine Learning. Springer

2020
[70]

Lauand, C. K. and Meyn, S. (2024). Revisiting step-size assumptions in stochastic approxi- mation.Preprint Arxiv:2405.17834

work page arXiv 2024
[71]

Levin, D. A. and Peres, Y. (2017).Markov Chains and Mixing Times, volume 107. American Mathematical Soc

2017
[72]

Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40

2016
[73]

Li, G., Cai, C., Chen, Y., Gu, Y., Wei, Y., and Chi, Y. (2021). Tightening the dependence on horizon in the sample complexity of q-learning. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6296–6306. PMLR

2021
[74]

Li, G., Cai, C., Chen, Y., Wei, Y., and Chi, Y. (2024a). Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236
[75]

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020a). Breaking the sample size barrier in model-based reinforcement learning with a generative model.Advances in neural information processing systems, 33
[76]

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020b). Sample complexity of asynchronous q- learning: Sharper analysis and variance reduction. InAdvances in Neural Information Processing Systems, volume 33, pages 7031–7043. Curran Associates, Inc
[77]

Li, G., Wu, W., Chi, Y., Ma, C., Rinaldo, A., and Wei, Y. (2024b). High-probability sample complexities for policy evaluation with linear function approximation.IEEE Transactions on Information Theory, 70(8):5969–5999
[78]

Li, Y., Lan, G., and Zhao, T. (2022). First-order policy optimization for robust markov decision process.Preprint Arxiv:2209.10579

work page arXiv 2022
[79]

Liu, X., Xie, Z., and Zhang, S. (2025). Linear q-learning does not diverge inl 2: Convergence rates to a bounded set. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 39871–39897. PMLR

2025
[80]

Ljung, L. (1977). Analysis of recursive stochastic algorithms.IEEE transactions on automatic control, 22(4):551–575

1977

Showing first 80 references.

[1] [1]

P., and Borkar, V

Abounadi, J., Bertsekas, D. P., and Borkar, V. (2002). Stochastic approximation for nonexpan- sive maps: Application to q-learning algorithms.SIAM Journal on Control and Optimization, 41(1):1–22

2002

[2] [2]

M., Crump, T., and Far, B

Afsar, M. M., Crump, T., and Far, B. (2022). Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38

2022

[3] [3]

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

Agrawal, S., Maguluri, S. T., and Zubeldia, M. (2026). Concentration of general stochastic approximation under heavy-tailed Markovian noise.Preprint Arxiv: 2605.20999

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine learning proceedings 1995, pages 30–37. Elsevier

1995

[5] [5]

Banach, S. (1922). Sur les op´ erations dans les ensembles abstraits et leur application aux ´ equations int´ egrales.Fund. math, 3(1):133–181

1922

[6] [6]

and Gupta, A

Bansal, N. and Gupta, A. (2019). Potential-function proofs for gradient methods.Theory of Computing, 15(1):1–32. 27

2019

[7] [7]

and Laraki, R

Baudin, L. and Laraki, R. (2022). Fictitious play and best-response dynamics in identical interest and zero-sum stochastic games. InInternational Conference on Machine Learning, pages 1664–1690. PMLR

2022

[8] [8]

(2017).First-Order Methods in Optimization

Beck, A. (2017).First-Order Methods in Optimization. SIAM

2017

[9] [9]

Bellman, R. (1957). Dynamic programming.Press Princeton, New Jersey, 39

1957

[10] [10]

Bena¨ ım, M., Hofbauer, J., and Sorin, S. (2005). Stochastic approximations and differential inclusions.SIAM Journal on Control and Optimization, 44(1):328–348

2005

[11] [11]

Bena¨ ım, M., Hofbauer, J., and Sorin, S. (2006). Stochastic approximations and differential inclusions, part ii: Applications.Mathematics of Operations Research, 31(4):673–695

2006

[12] [12]

(2012).Adaptive Algorithms and Stochastic Approximations, volume 22

Benveniste, A., M´ etivier, M., and Priouret, P. (2012).Adaptive Algorithms and Stochastic Approximations, volume 22. Springer Science & Business Media

2012

[13] [13]

Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-Dynamic Programming. Athena Scientific

1996

[14] [14]

and Zhang, S

Blaser, E. and Zhang, S. (2026). Asymptotic and finite sample analysis of nonexpansive stochastic approximations with markovian noise. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 19764–19772

2026

[15] [15]

Border, K. C. (1985).Fixed-Point Theorems with Applications to Economics and Game Theory. Cambridge university press

1985

[16] [16]

Borkar, V., Chen, S., Devraj, A., Kontoyiannis, I., and Meyn, S. (2025). The ode method for asymptotic statistics in stochastic approximation and reinforcement learning.The Annals of Applied Probability, 35(2):936–982

2025

[17] [17]

Borkar, V. S. (2009).Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer

2009

[18] [18]

Borkar, V. S. (2021). A concentration bound for contractive stochastic approximation.Systems & Control Letters, 153:104947

2021

[19] [19]

Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approx- imation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469

2000

[20] [20]

E., and Nocedal, J

Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning.Siam Review, 60(2):223–311

2018

[21] [21]

and Cominetti, R

Bravo, M. and Cominetti, R. (2024). Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219

2024

[22] [22]

C., Gerencs´ er, B., Gerencs´ er, L., and R´ asonyi, M

Car` e, A., Cs´ aji, B. C., Gerencs´ er, B., Gerencs´ er, L., and R´ asonyi, M. (2026). Stochastic approximation in a markovian framework revisited: Lipschitz continuity of the poisson equation. Mathematics of Control, Signals, and Systems, pages 1–43

2026

[23] [23]

S., and Dodhia, P

Chandak, S., Borkar, V. S., and Dodhia, P. (2022). Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430. 28

2022

[24] [24]

U., and Bambos, N

Chandak, S., Haque, S. U., and Bambos, N. (2025). Finite-time bounds for two-time-scale stochastic approximation with arbitrary norm contractions and markovian noise. In2025 IEEE 64th Conference on Decision and Control (CDC), pages 6095–6101. IEEE

2025

[25] [25]

Chen, Z., Clarke, J.-P., and Maguluri, S. T. (2023). Target network and truncation overcome the deadly triad in q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101

2023

[26] [26]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes.Advances in Neural Information Processing Systems, 33:8223–8234

2020

[27] [27]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2021). Finite-sample analysis of off-policy td-learning via generalized bellman operators.Advances in Neural Information Processing Systems, 34:21440–21452

2021

[28] [28]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov the- ory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352–1367

2024

[29] [29]

T., and Zubeldia, M

Chen, Z., Maguluri, S. T., and Zubeldia, M. (2025a). Concentration of contractive stochastic approximation: Additive and multiplicative noise.The Annals of Applied Probability, 35(2):1298– 1352

[30] [30]

T., Clarke, J.-P., and Maguluri, S

Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., and Maguluri, S. T. (2022). Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Au- tomatica, 146:110623

2022

[31] [31]

U., and Maguluri, S

Chen, Z., Zhang, S., Zhang, Z., Haque, S. U., and Maguluri, S. T. (2025b). A non-asymptotic theory of seminorm lyapunov stability: From deterministic to stochastic iterative algorithms. Preprint Arxiv:2502.14208

work page arXiv

[32] [32]

Chung, K. L. (1954). On a stochastic approximation method.The Annals of Mathematical Statistics, pages 463–483

1954

[33] [33]

Cutler, J., Diaz, M., and Drusvyatskiy, D. (2024). Stochastic approximation with decision- dependent distributions: Asymptotic normality and optimality.Journal of Machine Learning Research, 25(90):1–49

2024

[34] [34]

Dalal, G., Thoppe, G., Sz¨ or´ enyi, B., and Mannor, S. (2018). Finite sample analysis of two- timescale stochastic approximation with applications to reinforcement learning. InConference On Learning Theory, pages 1199–1233. PMLR

2018

[35] [35]

Davis, D., Drusvyatskiy, D., and Jiang, L. (2024). Asymptotic normality and optimality in nonsmooth stochastic approximation.The Annals of Statistics, 52(4):1485–1508

2024

[36] [36]

Deb, R., Ganesh, S., and Bhatnagar, S. (2025). Multi-timescale stochastic approximation: Stability and convergence.Preprint Arxiv:2112.03515. 29

work page arXiv 2025

[37] [37]

Doan, T. T. (2021). Finite-time analysis and restarting scheme for linear two-time-scale stochastic approximation.SIAM Journal on Control and Optimization, 59(4):2798–2819

2021

[38] [38]

Doan, T. T. (2022a). Finite-time analysis of markov gradient descent.IEEE Transactions on Automatic Control, 68(4):2140–2153

[39] [39]

Doan, T. T. (2022b). Nonlinear two-time-scale stochastic approximation: Convergence and finite-time performance.IEEE Transactions on Automatic Control, 68(8):4695–4705

[40] [40]

(2018).Markov Chains, volume 4

Douc, R., Moulines, E., Priouret, P., and Soulier, P. (2018).Markov Chains, volume 4. Springer

2018

[41] [41]

C., Agarwal, A., Johansson, M., and Jordan, M

Duchi, J. C., Agarwal, A., Johansson, M., and Jordan, M. I. (2012). Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578

2012

[42] [42]

Durmus, A., Moulines, E., Naumov, A., and Samsonov, S. (2025). Finite-time high-probability bounds for polyak–ruppert averaged iterates of linear stochastic approximation.Mathematics of Operations Research, 50(2):935–964

2025

[43] [43]

Durmus, A., Moulines, E., Naumov, A., Samsonov, S., Scaman, K., and Wai, H.-T. (2021). Tight high probability bounds for linear stochastic approximation with fixed stepsize.Advances in Neural Information Processing Systems, 34:30063–30074

2021

[44] [44]

(2019).Probability: Theory and Examples, volume 49

Durrett, R. (2019).Probability: Theory and Examples, volume 49. Cambridge university press

2019

[45] [45]

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. InInternational Conference on Machine Learning, pages 1407–1416

2018

[46] [46]

and Mansour, Y

Even-Dar, E. and Mansour, Y. (2003). Learning rates forQ-learning.Journal of Machine Learning Research, 5(Dec):1–25

2003

[47] [47]

Fabian, V. (1968). On asymptotic normality in stochastic approximation.The Annals of Mathematical Statistics, pages 1327–1332

1968

[48] [48]

Fort, G. (2015). Central limit theorems for stochastic approximation with controlled markov chain dynamics.ESAIM: Probability and Statistics, 19:60–80

2015

[49] [49]

Gheshlaghi Azar, M., Munos, R., and Kappen, H. J. (2013). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349

2013

[50] [50]

Glynn, P. W. and Iglehart, D. L. (1989). Importance sampling for stochastic simulations. Management science, 35(11):1367–1392

1989

[51] [51]

and Thoppe, G

Gopalan, A. and Thoppe, G. (2023). Demystifying approximate reinforcement learning with ϵ-greedy exploration: A differential inclusion view

2023

[52] [52]

Gosavi, A. (2006). Boundedness of iterates inQ-learning.Systems & control letters, 55(4):347– 349. 30

2006

[53] [53]

and Nemirovski, A

Guzm´ an, C. and Nemirovski, A. (2015). On lower complexity bounds for large-scale smooth convex optimization.Journal of Complexity, 31(1):1–14

2015

[54] [54]

Haddad, W. M. and Chellaboina, V. (2011).Nonlinear Dynamical Systems and Control: A Lyapunov-Based Approach. Princeton University Press

2011

[55] [55]

Haque, S. U. and Maguluri, S. T. (2025). Stochastic approximation with unbounded markovian noise: A general-purpose theorem. InProceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 ofProceedings of Machine Learning Research, pages 3718–3726. PMLR

2025

[56] [56]

G., Stepleton, T., and Munos, R

Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. (2016). Q(λ) with off- policy corrections. InInternational Conference on Algorithmic Learning Theory, pages 305–320. Springer

2016

[57] [57]

Harvey, N. J. A., Liaw, C., and Randhawa, S. (2019). Simple and optimal high-probability bounds for strongly-convex stochastic gradient descent.Preprint Arxiv:1909.00843

work page arXiv 2019

[58] [58]

and Kale, S

Hazan, E. and Kale, S. (2014). Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization.Journal of Machine Learning Research, 15:2489–2512

2014

[59] [59]

and Sandholm, W

Hofbauer, J. and Sandholm, W. H. (2002). On the global convergence of stochastic fictitious play.Econometrica, 70(6):2265–2294

2002

[60] [60]

A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

Jin, C., Netrapalli, P., Ge, R., Kakade, S. M., and Jordan, M. I. (2019). A short note on concentration inequalities for random vectors with subgaussian norm.Preprint arXiv:1902.03736

work page internal anchor Pith review Pith/arXiv arXiv 2019

[61] [61]

Kaledin, M., Moulines, E., Naumov, A., Tadic, V., and Wai, H.-T. (2020). Finite time anal- ysis of linear two-timescale stochastic approximation with markovian noise. InConference on Learning Theory, pages 2144–2203. PMLR

2020

[62] [62]

Khalil, H. K. and Grizzle, J. W. (2002).Nonlinear Systems, volume 3. Prentice hall Upper Saddle River, NJ

2002

[63] [63]

Khodadadian, S., Sharma, P., Joshi, G., and Maguluri, S. T. (2022). Federated reinforcement learning: Linear speedup under markovian sampling. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 10997–11057. PMLR

2022

[64] [64]

Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. InAdvances in neural information processing systems, pages 1008–1014. Citeseer

2000

[65] [65]

Konda, V. R. and Tsitsiklis, J. N. (2004). Convergence rate of linear two-time-scale stochastic approximation.The Annals of Applied Probability, 14(2):796–819

2004

[66] [66]

Kushner, H. J. and Clark, D. S. (2012).Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26. Springer Science & Business Media

2012

[67] [67]

and Lefschetz, S

La Salle, J. and Lefschetz, S. (2012).Stability by Liapunov’s Direct Method with Applications, volume 4. Elsevier. 31

2012

[68] [68]

and Szepesv´ ari, C

Lakshminarayanan, C. and Szepesv´ ari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? InProceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 ofProceedings of Machine Learning Research, pages 1347–1355. PMLR

2018

[69] [69]

(2020).First-Order and Stochastic Optimization Methods for Machine Learning

Lan, G. (2020).First-Order and Stochastic Optimization Methods for Machine Learning. Springer

2020

[70] [70]

Lauand, C. K. and Meyn, S. (2024). Revisiting step-size assumptions in stochastic approxi- mation.Preprint Arxiv:2405.17834

work page arXiv 2024

[71] [71]

Levin, D. A. and Peres, Y. (2017).Markov Chains and Mixing Times, volume 107. American Mathematical Soc

2017

[72] [72]

Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40

2016

[73] [73]

Li, G., Cai, C., Chen, Y., Gu, Y., Wei, Y., and Chi, Y. (2021). Tightening the dependence on horizon in the sample complexity of q-learning. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6296–6306. PMLR

2021

[74] [74]

Li, G., Cai, C., Chen, Y., Wei, Y., and Chi, Y. (2024a). Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236

[75] [75]

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020a). Breaking the sample size barrier in model-based reinforcement learning with a generative model.Advances in neural information processing systems, 33

[76] [76]

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020b). Sample complexity of asynchronous q- learning: Sharper analysis and variance reduction. InAdvances in Neural Information Processing Systems, volume 33, pages 7031–7043. Curran Associates, Inc

[77] [77]

Li, G., Wu, W., Chi, Y., Ma, C., Rinaldo, A., and Wei, Y. (2024b). High-probability sample complexities for policy evaluation with linear function approximation.IEEE Transactions on Information Theory, 70(8):5969–5999

[78] [78]

Li, Y., Lan, G., and Zhao, T. (2022). First-order policy optimization for robust markov decision process.Preprint Arxiv:2209.10579

work page arXiv 2022

[79] [79]

Liu, X., Xie, Z., and Zhang, S. (2025). Linear q-learning does not diverge inl 2: Convergence rates to a bounded set. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 39871–39897. PMLR

2025

[80] [80]

Ljung, L. (1977). Analysis of recursive stochastic algorithms.IEEE transactions on automatic control, 22(4):551–575

1977