pith. machine review for the scientific record. sign in

arxiv: 2605.07104 · v1 · submitted 2026-05-08 · 💻 cs.LG · math.OC· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:19 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords stochastic approximationreinforcement learningalmost sure convergenceMarkovian noisePoisson equationMoreau envelopeLyapunov driftQ-learning
0
0 comments X

The pith

A Poisson-Moreau drift establishes almost sure convergence rates for stochastic approximation under Markovian noise approaching the optimal o(n^{-1}).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a new analysis technique for stochastic approximation algorithms whose expected updates are contractive, a setting that covers many reinforcement learning methods such as Q-learning. By combining Moreau-envelope smoothing with a Poisson-equation correction to handle Markovian noise, the authors derive almost sure convergence rates that are arbitrarily close to o(n^{1-2η}) for power-law learning rates with exponent η between 1/2 and 1, and close to o(n^{-1}) for the harmonic rate. This matters because establishing almost sure rates under dependent noise has been difficult, and the new rates come close to what is known to be optimal in the independent case. A reader would care as these guarantees apply directly to practical algorithms running in real-world sequential decision environments where samples are correlated.

Core claim

The central claim is that a Lyapunov drift constructed by applying a Poisson-equation-based correction for Markovian noise to the Moreau-envelope smoothing of the contractive mapping yields almost sure convergence rates arbitrarily close to o(n^{1-2η}) for stepsizes of order n^{-η} with η ∈ (1/2, 1), and arbitrarily close to o(n^{-1}) for harmonic stepsizes of order n^{-1}, the latter being near the law of the iterated logarithm bound.

What carries the argument

The Poisson-Moreau drift: a Lyapunov function that smooths the contractive mapping via its Moreau envelope and corrects the drift term using the solution to the Poisson equation for the Markov noise process.

If this is right

  • The rates apply to common RL algorithms like Q-learning and linear TD learning.
  • For harmonic learning rates the almost sure rate is nearly optimal as per the law of the iterated logarithm.
  • The analysis extends previous results that were limited to i.i.d. noise or weaker rates.
  • Power-law rates achieve convergence faster than any power smaller than the claimed exponent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This drift construction might be adapted to other non-i.i.d. settings in optimization if a suitable Poisson solution exists.
  • Practical RL implementations could use these rates to set learning rate schedules with theoretical backing for almost sure behavior.
  • If the contractivity assumption holds in more general function approximation, the method could cover deeper RL algorithms.

Load-bearing premise

The expected updates must form a contractive mapping and the Markovian noise must admit a solution to the Poisson equation used in the drift correction.

What would settle it

A counterexample where a contractive stochastic approximation with Markov noise satisfying the Poisson equation fails to achieve the stated almost sure rate, or a direct computation on a simple case showing slower convergence.

read the original abstract

Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are contractive, a setting that arises in many reinforcement learning algorithms such as $Q$-learning and linear temporal difference learning. Specifically, for a power-law learning rate $O(n^{-\eta})$ with $\eta \in (1/2, 1)$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{1 - 2\eta})$. For a harmonic learning rate $O(n^{-1})$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{-1})$, which we argue is a strong result because it is close to the optimal rate $O(n^{-1}\log\log n)$ given by the law of the iterated logarithm (for a special case of i.i.d. noise). Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing for the contractive mapping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript proposes a new Lyapunov drift technique called the Poisson-Moreau drift for analyzing almost sure convergence rates of stochastic approximation (SA) algorithms with contractive mean updates under Markovian noise. This setting is relevant to reinforcement learning methods such as Q-learning and linear temporal difference learning. For learning rates of the form O(n^{-η}) where η ∈ (1/2, 1), the paper derives an almost sure convergence rate that can be made arbitrarily close to o(n^{1-2η}). For the harmonic learning rate O(n^{-1}), the rate is arbitrarily close to o(n^{-1}), argued to be strong as it approaches the optimal rate from the law of the iterated logarithm in the i.i.d. noise case. The key innovation is applying a Poisson-equation based correction for the Markovian noise to the Moreau-envelope smoothed contractive mapping, leading to a drift inequality amenable to supermartingale arguments.

Significance. This work is significant because establishing almost sure rates for SA under Markovian noise has been challenging, and the results here are close to optimal. The Poisson-Moreau construction provides a systematic way to handle both the noise correlation and the contractivity, potentially applicable to other algorithms. Strengths include the explicit conditions for the Poisson solution existence and regularity, and the use of standard tools like Robbins-Siegmund lemma after the drift construction. If verified, it advances the theory for RL convergence analysis.

major comments (1)
  1. Main convergence theorem: the claim that the rate is 'arbitrarily close to o(n^{1-2η})' depends on the choice of the Moreau-envelope smoothing parameter as a function of η and the Poisson-solution regularity constants. The manuscript must explicitly derive this parameter selection and show that it remains compatible with the contractivity assumption without introducing hidden dependencies that would alter the exponent.
minor comments (3)
  1. Abstract: the applicability to Q-learning and linear TD is stated but not illustrated with a concrete mapping of the contractive update and Poisson equation; adding one short example would immediately clarify the scope.
  2. Related-work section: prior results on Poisson corrections for Markovian SA (e.g., those using different Lyapunov constructions) are referenced but the precise technical advantage of the Moreau-envelope step over those approaches is not contrasted in a dedicated paragraph.
  3. Notation: the symbols for the Poisson solution (its boundedness and Lipschitz constants) and the Moreau envelope parameter appear in multiple places; a short table of symbols at the end of the preliminaries would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment of the significance of the Poisson-Moreau drift, and recommendation for minor revision. We address the major comment below.

read point-by-point responses
  1. Referee: Main convergence theorem: the claim that the rate is 'arbitrarily close to o(n^{1-2η})' depends on the choice of the Moreau-envelope smoothing parameter as a function of η and the Poisson-solution regularity constants. The manuscript must explicitly derive this parameter selection and show that it remains compatible with the contractivity assumption without introducing hidden dependencies that would alter the exponent.

    Authors: We agree that an explicit derivation of the smoothing parameter would improve clarity. In the current manuscript the choice of the Moreau-envelope parameter (denoted ε) is determined inside the proof of the main theorem so that the approximation error is controlled by the target rate while preserving a uniform contraction factor strictly less than one; however, this dependence on η and the Poisson-solution constants (Lipschitz modulus and bound M) is not isolated in a remark or statement preceding the theorem. In the revised version we will add an explicit derivation (as a short lemma or dedicated paragraph in the proof) that selects ε = ε(η, L, M) sufficiently small, independent of n, such that the smoothed mapping remains contractive with modulus α' < 1 that does not depend on n or ε in a way that changes the exponent. The resulting drift inequality then yields the claimed rate o(n^{1-2η}) (arbitrarily close) without hidden n-dependent factors altering the exponent, because all ε-induced error terms are absorbed into the o(·) notation for the chosen scaling. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The central Lyapunov drift is constructed by combining a standard Poisson-equation correction (under explicitly stated existence and regularity conditions for the Markovian noise) with Moreau-envelope smoothing (under the contractivity assumption on the mean-field operator). These are applied to obtain a supermartingale inequality that is then fed into classical Robbins-Siegmund or supermartingale convergence arguments. No equation reduces to a fitted parameter renamed as a prediction, no self-definitional loop appears, and no load-bearing uniqueness theorem or ansatz is imported solely via self-citation. The stated almost-sure rates are direct consequences of the drift inequality plus standard martingale tools; the derivation remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the contractivity of the expected update mapping and the existence of a Poisson equation solution for the Markovian noise; these are domain assumptions standard in the field but not independently verified here.

axioms (2)
  • domain assumption The expected update mapping is contractive
    Stated explicitly as the setting for the class of algorithms including Q-learning and linear TD.
  • domain assumption Markovian noise admits a solution to the Poisson equation
    Invoked for the correction term in the novel Lyapunov drift construction.
invented entities (1)
  • Poisson-Moreau drift no independent evidence
    purpose: Novel Lyapunov function for establishing almost sure convergence rates
    Introduced as the key technical tool combining Poisson correction and Moreau-envelope smoothing.

pith-pipeline@v0.9.0 · 5501 in / 1367 out tokens · 36335 ms · 2026-05-11T01:19:38.000039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages

  1. [1]

    10 amazon statistics you need to know in 2022

    Mohsin, Maryam. 10 amazon statistics you need to know in 2022. Oberlo. 2022

  2. [2]

    and Deng, Yanzhen and Laber, Eric B

    Murphy, Susan A. and Deng, Yanzhen and Laber, Eric B. and Maei, Hamid Reza and Sutton, Richard S. and Witkiewitz, Katie. A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward. ArXiv Preprint. 2016

  3. [3]

    A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

    Xie, Tengyang and Liu, Bo and Xu, Yangyang and Ghavamzadeh, Mohammad and Chow, Yinlam and Lyu, Daoming and Yoon, Daesub. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization. Advances in Neural Information Processing Systems. 2018

  4. [4]

    A Closer Look at Deep Policy Gradients

    Ilyas, Andrew and Engstrom, Logan and Santurkar, Shibani and Tsipras, Dimitris and Janoos, Firdaus and Rudolph, Larry and Madry, Aleksander. A Closer Look at Deep Policy Gradients. Proceedings of the International Conference on Learning Representations. 2020

  5. [5]

    and Castro, Pablo Samuel

    Lyle, Clare and Bellemare, Marc G. and Castro, Pablo Samuel. A Comparative Analysis of Expected and Distributional Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019

  6. [6]

    A Concentration Bound for TD (0) with Function Approximation

    Chandak, Siddharth and Borkar, Vivek S. A Concentration Bound for TD (0) with Function Approximation. ArXiv Preprint. 2023

  7. [7]

    and Precup, Doina

    Perkins, Theodore J. and Precup, Doina. A Convergent Form of Approximate Policy Iteration. Advances in Neural Information Processing Systems. 2002

  8. [8]

    and Szepesv \' a ri, Csaba and Maei, Hamid Reza

    Sutton, Richard S. and Szepesv \' a ri, Csaba and Maei, Hamid Reza. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. Advances in Neural Information Processing Systems. 2008

  9. [9]

    A Convergent Off-Policy Temporal Difference Algorithm

    Diddigi, Raghuram Bharadwaj and Kamanchi, Chandramouli and Bhatnagar, Shalabh. A Convergent Off-Policy Temporal Difference Algorithm. Proceedings of the European Conference on Artificial Intelligence. 2020

  10. [10]

    A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

    Zhang, Shangtong and Laroche, Romain and van Seijen, Harm and Whiteson, Shimon and des Combes, Remi Tachet. A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2022

  11. [11]

    A Deeper Look at Planning as Learning from Replay

    Vanseijen, Harm and Sutton, Rich. A Deeper Look at Planning as Learning from Replay. Proceedings of the International Conference on Machine Learning. 2015

  12. [12]

    A Definition of Continual Reinforcement Learning

    Abel, David and Barreto, Andr \'e and Van Roy, Benjamin and Precup, Doina and van Hasselt, Hado and Singh, Satinder. A Definition of Continual Reinforcement Learning. Advances in Neural Information Processing Systems. 2023

  13. [13]

    and Dabney, Will and Munos, R \' e mi

    Bellemare, Marc G. and Dabney, Will and Munos, R \' e mi. A Distributional Perspective on Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

  14. [14]

    A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

    Bhandari, Jalaj and Russo, Daniel and Singal, Raghav. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. Proceedings of the Conference on Learning Theory. 2018

  15. [15]

    A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods

    Wu, Yue and Zhang, Weitong and Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods. Advances in Neural Information Processing Systems. 2020

  16. [16]

    A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation

    Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation. Proceedings of the International Conference on Machine Learning. 2020

  17. [17]

    A Formalization of the Ionescu-Tulcea Theorem in Mathlib

    Marion, Etienne. A Formalization of the Ionescu-Tulcea Theorem in Mathlib. ArXiv Preprint. 2025

  18. [18]

    A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

    Yu, Huizhen. A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies. Proceedings of the Conference in Uncertainty in Artificial Intelligence. 2005

  19. [19]

    A Generalized Reinforcement-Learning Model: Convergence and Applications

    Littman, Michael L and Szepesv \'a ri, Csaba. A Generalized Reinforcement-Learning Model: Convergence and Applications. Proceedings of the International Conference on Machine Learning. 1996

  20. [20]

    and Dabney, Will and Dadashi, Robert and Ta

    Bellemare, Marc G. and Dabney, Will and Dadashi, Robert and Ta. A Geometric Perspective on Optimal Representations for Reinforcement Learning. Advances in Neural Information Processing Systems. 2019

  21. [21]

    A Kernel Loss for Solving the Bellman Equation

    Feng, Yihao and Li, Lihong and Liu, Qiang. A Kernel Loss for Solving the Bellman Equation. Advances in Neural Information Processing Systems. 2019

  22. [22]

    and Bellemare, Marc G

    Machado, Marlos C. and Bellemare, Marc G. and Bowling, Michael H. A Laplacian Framework for Option Discovery in Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

  23. [23]

    A Liapounov bound for solutions of the Poisson equation

    Glynn, Peter W and Meyn, Sean P. A Liapounov bound for solutions of the Poisson equation. The Annals of Probability. 1996

  24. [24]

    A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation

    Chen, Zaiwei and Maguluri, Siva Theja and Shakkottai, Sanjay and Shanmugam, Karthikeyan. A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation. Operations Research. 2023

  25. [25]

    A Markovian decision process

    Bellman, Richard. A Markovian decision process. Journal of Mathematics and Mechanics. 1957

  26. [26]

    A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

    Lazic, Nevena and Yin, Dong and Farajtabar, Mehrdad and Levine, Nir and G. A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs. Advances in Neural Information Processing Systems. 2020

  27. [27]

    and Cohen, Paul R

    Oates, Tim and Schmill, Matthew D. and Cohen, Paul R. A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments. Proceedings of the AAAI Conference on Artificial Intelligence. 2000

  28. [28]

    A New Challenge in Policy Evaluation

    Zhang, Shangtong. A New Challenge in Policy Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence. 2023

  29. [29]

    A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms

    Chen, Zaiwei and Zhang, Sheng and Zhang, Zhe and Haque, Shaan Ul and Maguluri, Siva Theja. A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms. ArXiv Preprint. 2025

  30. [30]

    A Nonparametric Offpolicy Policy Gradient

    Tosatto, Samuele and Carvalho, Jo a o and Abdulsamad, Hany and Peters, Jan. A Nonparametric Offpolicy Policy Gradient. ArXiv Preprint. 2020

  31. [31]

    A Reinforcement Learning Method for Maximizing Undiscounted Rewards

    Schwartz, Anton. A Reinforcement Learning Method for Maximizing Undiscounted Rewards. Proceedings of the International Conference on Machine Learning. 1993

  32. [32]

    A Remark on a Theorem of M

    Edelstein, Michael. A Remark on a Theorem of M. A. Krasnoselski. American Mathematical Monthly. 1966

  33. [33]

    A Self-Tuning Actor-Critic Algorithm

    Zahavy, Tom and Xu, Zhongwen and Veeriah, Vivek and Hessel, Matteo and Oh, Junhyuk and van Hasselt, Hado P and Silver, David and Singh, Satinder. A Self-Tuning Actor-Critic Algorithm. Advances in Neural Information Processing Systems. 2020

  34. [34]

    A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation

    Mitra, Aritra. A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation. IEEE Transactions on Automatic Control. 2025

  35. [35]

    A Simple Framework for Contrastive Learning of Visual Representations

    Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey E. A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the International Conference on Machine Learning. 2020

  36. [36]

    A Survey for Deep Reinforcement Learning Based Network Intrusion Detection

    Yang, Wanrong and Acuto, Alberto and Zhou, Yihang and Wojtczak, Dominik. A Survey for Deep Reinforcement Learning Based Network Intrusion Detection. ArXiv Preprint. 2024

  37. [37]

    A Survey of Constraint Formulations in Safe Reinforcement Learning

    Wachi, Akifumi and Shen, Xun and Sui, Yanan. A Survey of Constraint Formulations in Safe Reinforcement Learning. ArXiv Preprint. 2024

  38. [38]

    A Survey of In-Context Reinforcement Learning

    Moeini, Amir and Wang, Jiuqi and Beck, Jacob and Blaser, Ethan and Whiteson, Shimon and Chandra, Rohan and Zhang, Shangtong. A Survey of In-Context Reinforcement Learning. ArXiv Preprint. 2025

  39. [39]

    and Cowling, Peter I

    Browne, Cameron and Powley, Edward Jack and Whitehouse, Daniel and Lucas, Simon M. and Cowling, Peter I. and Rohlfshagen, Philipp and Tavener, Stephen and Liebana, Diego Perez and Samothrakis, Spyridon and Colton, Simon. A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games. 2012

  40. [40]

    A Theoretical Analysis of Deep Q-Learning

    Fan, Jianqing and Wang, Zhaoran and Xie, Yuchen and Yang, Zhuoran. A Theoretical Analysis of Deep Q-Learning. Proceedings of the Annual Conference on Learning for Dynamics and Control. 2020

  41. [41]

    A Tutorial on Meta-Reinforcement Learning

    Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon. A Tutorial on Meta-Reinforcement Learning. Foundations and Trends in Machine Learning. 2025

  42. [42]

    A Unified Switching System Perspective and Convergence Analysis of Q-Learning Algorithms

    Lee, Donghwan and He, Niao. A Unified Switching System Perspective and Convergence Analysis of Q-Learning Algorithms. Advances in Neural Information Processing Systems. 2020

  43. [43]

    A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

    Wu, Zechen and Greenwald, Amy and Parr, Ronald. A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning. ArXiv Preprint. 2025

  44. [44]

    A class of distortion operators for pricing financial and insurance risks

    Wang, Shaun S. A class of distortion operators for pricing financial and insurance risks. Journal of Risk and Insurance. 2000

  45. [45]

    A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects

    Ezugwu, Absalom E and Ikotun, Abiodun M and Oyelade, Olaide O and Abualigah, Laith and Agushaka, Jeffery O and Eke, Christopher I and Akinyelu, Andronicus A. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence. 2022

  46. [46]

    A comprehensive survey on pretrained foundation models: A history from bert to chatgpt

    Zhou, Ce and Li, Qian and Li, Chen and Yu, Jun and Liu, Yixin and Wang, Guangjing and Zhang, Kai and Ji, Cheng and Yan, Qiben and He, Lifang and others. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. ArXiv Preprint. 2023

  47. [47]

    A comprehensive survey on safe reinforcement learning

    Garc a, Javier and Fern \'a ndez, Fernando. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research. 2015

  48. [48]

    A concentration bound for contractive stochastic approximation

    Borkar, Vivek S. A concentration bound for contractive stochastic approximation. Systems & Control Letters. 2021

  49. [49]

    A concentration bound for stochastic approximation via Alekseev's formula

    Thoppe, Gugan and Borkar, Vivek. A concentration bound for stochastic approximation via Alekseev's formula. Stochastic Systems. 2019

  50. [50]

    A contextual-bandit approach to personalized news article recommendation

    Li, Lihong and Chu, Wei and Langford, John and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. Proceedings of the International Conference on World Wide Web. 2010

  51. [51]

    A convergence theorem for non negative almost supermartingales and some applications

    Robbins, Herbert and Siegmund, David. A convergence theorem for non negative almost supermartingales and some applications. Optimizing Methods in Statistics. 1971

  52. [52]

    A convergence theory for deep learning via over-parameterization

    Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao. A convergence theory for deep learning via over-parameterization. Proceedings of the International Conference on Machine Learning. 2019

  53. [53]

    A course on multi-armed bandits and reinforcement learning

    Agrawal, Shipra. A course on multi-armed bandits and reinforcement learning. 2018

  54. [54]

    A general-purpose theorem for high-probability bounds of stochastic approximation with polyak averaging

    Khodadadian, Sajad and Zubeldia, Martin. A general-purpose theorem for high-probability bounds of stochastic approximation with polyak averaging. ArXiv Preprint. 2025

  55. [55]

    A generalist agent

    Reed, Scott and Zolna, Konrad and Parisotto, Emilio and Colmenarejo, Sergio Gomez and Novikov, Alexander and Barth-Maron, Gabriel and Gimenez, Mai and Sulsky, Yury and Kay, Jackie and Springenberg, Jost Tobias and others. A generalist agent. ArXiv Preprint. 2022

  56. [56]

    A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions

    Ramaswamy, Arunselvan and Bhatnagar, Shalabh. A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Mathematics of Operations Research. 2017

  57. [57]

    A graph placement methodology for fast chip design

    Mirhoseini, Azalia and Goldie, Anna and Yazgan, Mustafa and Jiang, Joe Wenjie and Songhori, Ebrahim and Wang, Shen and Lee, Young-Joon and Johnson, Eric and Pathak, Omkar and Nazi, Azade and others. A graph placement methodology for fast chip design. Nature. 2021

  58. [58]

    A law of the iterated logarithm for stochastic approximation procedures in d-dimensional Euclidean space

    Koval, Valery and Schwabe, Rainer. A law of the iterated logarithm for stochastic approximation procedures in d-dimensional Euclidean space. Stochastic Processes and Their Applications. 2003

  59. [59]

    A lyapunov theory for finite-sample guarantees of markovian stochastic approximation

    Chen, Zaiwei and Maguluri, Siva T and Shakkottai, Sanjay and Shanmugam, Karthikeyan. A lyapunov theory for finite-sample guarantees of markovian stochastic approximation. Operations Research. 2024

  60. [60]

    A lyapunov-based approach to safe reinforcement learning

    Chow, Yinlam and Nachum, Ofir and Duenez-Guzman, Edgar and Ghavamzadeh, Mohammad. A lyapunov-based approach to safe reinforcement learning. Advances in Neural Information Processing Systems. 2018

  61. [61]

    A minimum relative entropy principle for learning and acting

    Ortega, Pedro A and Braun, Daniel A. A minimum relative entropy principle for learning and acting. Journal of Artificial Intelligence Research. 2010

  62. [62]

    A model for the encoding of experiential information

    Becker, Joseph D. A model for the encoding of experiential information. Computer Models of Thought and Language. 1973

  63. [63]

    A multimodal learning interface for grounding spoken language in sensory perceptions

    Yu, Chen and Ballard, Dana H. A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perception. 2004

  64. [64]

    A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run

    Castelletti, Andrea and Pianosi, Francesca and Restelli, Marcello. A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run. Water Resources Research. 2013

  65. [65]

    A natural policy gradient

    Kakade, Sham M. A natural policy gradient. Advances in Neural Information Processing Systems. 2001

  66. [66]

    A new Gradient TD Algorithm with only One Step-size: Convergence Rate Analysis using L - Smoothness

    Yao, Hengshuai. A new Gradient TD Algorithm with only One Step-size: Convergence Rate Analysis using L - Smoothness. ArXiv Preprint. 2023

  67. [67]

    A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free

    Chen, Yifang and Lee, Chung-Wei and Luo, Haipeng and Wei, Chen-Yu. A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. Proceedings of the Conference on Learning Theory. 2019

  68. [68]

    and Santos, Pedro

    Carvalho, Diogo and Melo, Francisco S. and Santos, Pedro. A new convergent variant of Q-learning with linear function approximation. Advances in Neural Information Processing Systems. 2020

  69. [69]

    A new Q ( ) with interim forward view and Monte Carlo equivalence

    Sutton, Richard and Mahmood, Ashique Rupam and Precup, Doina and Hasselt, Hado. A new Q ( ) with interim forward view and Monte Carlo equivalence. Proceedings of the International Conference on Machine Learning. 2014

  70. [70]

    A note on a conjecture concerning rank one perturbations of singular M-matrices

    Anehila, B and Ran, ACM. A note on a conjecture concerning rank one perturbations of singular M-matrices. Quaestiones Mathematicae. 2022

  71. [71]

    A perspective on off-policy evaluation in reinforcement learning

    Li, Lihong. A perspective on off-policy evaluation in reinforcement learning. Frontiers of Computer Science. 2019

  72. [72]

    A perspective view and survey of meta-learning

    Vilalta, Ricardo and Drissi, Youssef. A perspective view and survey of meta-learning. Artificial Intelligence Review. 2002

  73. [73]

    A pre-training based personalized dialogue generation model with persona-sparse data

    Zheng, Yinhe and Zhang, Rongsheng and Huang, Minlie and Mao, Xiaoxi. A pre-training based personalized dialogue generation model with persona-sparse data. Proceedings of the AAAI Conference on Artificial Intelligence. 2020

  74. [74]

    A primal-dual perspective of online learning algorithms

    Shalev-Shwartz, Shai and Singer, Yoram. A primal-dual perspective of online learning algorithms. Machine Learning. 2007

  75. [75]

    A remark on a theorem of MA Krasnoselski

    Edelstein, Michael. A remark on a theorem of MA Krasnoselski. Amer. Math. Monthly. 1966

  76. [76]

    A review of maximum power point tracking algorithms for wind energy systems

    Abdullah, Majid A and Yatim, AHM and Tan, Chee Wei and Saidur, Rahman. A review of maximum power point tracking algorithms for wind energy systems. Renewable and Sustainable Energy Reviews. 2012

  77. [77]

    A review of safe reinforcement learning: Methods, theory and applications

    Gu, Shangding and Yang, Long and Du, Yali and Chen, Guang and Walter, Florian and Wang, Jun and Knoll, Alois. A review of safe reinforcement learning: Methods, theory and applications. ArXiv Preprint. 2022

  78. [78]

    A rewriting system for convex optimization problems

    Agrawal, Akshay and Verschueren, Robin and Diamond, Steven and Boyd, Stephen. A rewriting system for convex optimization problems. Journal of Control and Decision. 2018

  79. [79]

    A singular M -matrix perturbed by a nonnegative rank one matrix has positive principal minors; is it D -stable?

    Bierkens, Joris and Ran, Andr \'e. A singular M -matrix perturbed by a nonnegative rank one matrix has positive principal minors; is it D -stable?. Linear Algebra and Its Applications. 2014

  80. [80]

    A small gain analysis of single timescale actor critic

    Olshevsky, Alex and Gharesifard, Bahman. A small gain analysis of single timescale actor critic. SIAM Journal on Control and Optimization. 2023

Showing first 80 references.