Recognition: 2 theorem links
· Lean TheoremAlmost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift
Pith reviewed 2026-05-11 01:19 UTC · model grok-4.3
The pith
A Poisson-Moreau drift establishes almost sure convergence rates for stochastic approximation under Markovian noise approaching the optimal o(n^{-1}).
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a Lyapunov drift constructed by applying a Poisson-equation-based correction for Markovian noise to the Moreau-envelope smoothing of the contractive mapping yields almost sure convergence rates arbitrarily close to o(n^{1-2η}) for stepsizes of order n^{-η} with η ∈ (1/2, 1), and arbitrarily close to o(n^{-1}) for harmonic stepsizes of order n^{-1}, the latter being near the law of the iterated logarithm bound.
What carries the argument
The Poisson-Moreau drift: a Lyapunov function that smooths the contractive mapping via its Moreau envelope and corrects the drift term using the solution to the Poisson equation for the Markov noise process.
If this is right
- The rates apply to common RL algorithms like Q-learning and linear TD learning.
- For harmonic learning rates the almost sure rate is nearly optimal as per the law of the iterated logarithm.
- The analysis extends previous results that were limited to i.i.d. noise or weaker rates.
- Power-law rates achieve convergence faster than any power smaller than the claimed exponent.
Where Pith is reading between the lines
- This drift construction might be adapted to other non-i.i.d. settings in optimization if a suitable Poisson solution exists.
- Practical RL implementations could use these rates to set learning rate schedules with theoretical backing for almost sure behavior.
- If the contractivity assumption holds in more general function approximation, the method could cover deeper RL algorithms.
Load-bearing premise
The expected updates must form a contractive mapping and the Markovian noise must admit a solution to the Poisson equation used in the drift correction.
What would settle it
A counterexample where a contractive stochastic approximation with Markov noise satisfying the Poisson equation fails to achieve the stated almost sure rate, or a direct computation on a simple case showing slower convergence.
read the original abstract
Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are contractive, a setting that arises in many reinforcement learning algorithms such as $Q$-learning and linear temporal difference learning. Specifically, for a power-law learning rate $O(n^{-\eta})$ with $\eta \in (1/2, 1)$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{1 - 2\eta})$. For a harmonic learning rate $O(n^{-1})$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{-1})$, which we argue is a strong result because it is close to the optimal rate $O(n^{-1}\log\log n)$ given by the law of the iterated logarithm (for a special case of i.i.d. noise). Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing for the contractive mapping.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a new Lyapunov drift technique called the Poisson-Moreau drift for analyzing almost sure convergence rates of stochastic approximation (SA) algorithms with contractive mean updates under Markovian noise. This setting is relevant to reinforcement learning methods such as Q-learning and linear temporal difference learning. For learning rates of the form O(n^{-η}) where η ∈ (1/2, 1), the paper derives an almost sure convergence rate that can be made arbitrarily close to o(n^{1-2η}). For the harmonic learning rate O(n^{-1}), the rate is arbitrarily close to o(n^{-1}), argued to be strong as it approaches the optimal rate from the law of the iterated logarithm in the i.i.d. noise case. The key innovation is applying a Poisson-equation based correction for the Markovian noise to the Moreau-envelope smoothed contractive mapping, leading to a drift inequality amenable to supermartingale arguments.
Significance. This work is significant because establishing almost sure rates for SA under Markovian noise has been challenging, and the results here are close to optimal. The Poisson-Moreau construction provides a systematic way to handle both the noise correlation and the contractivity, potentially applicable to other algorithms. Strengths include the explicit conditions for the Poisson solution existence and regularity, and the use of standard tools like Robbins-Siegmund lemma after the drift construction. If verified, it advances the theory for RL convergence analysis.
major comments (1)
- Main convergence theorem: the claim that the rate is 'arbitrarily close to o(n^{1-2η})' depends on the choice of the Moreau-envelope smoothing parameter as a function of η and the Poisson-solution regularity constants. The manuscript must explicitly derive this parameter selection and show that it remains compatible with the contractivity assumption without introducing hidden dependencies that would alter the exponent.
minor comments (3)
- Abstract: the applicability to Q-learning and linear TD is stated but not illustrated with a concrete mapping of the contractive update and Poisson equation; adding one short example would immediately clarify the scope.
- Related-work section: prior results on Poisson corrections for Markovian SA (e.g., those using different Lyapunov constructions) are referenced but the precise technical advantage of the Moreau-envelope step over those approaches is not contrasted in a dedicated paragraph.
- Notation: the symbols for the Poisson solution (its boundedness and Lipschitz constants) and the Moreau envelope parameter appear in multiple places; a short table of symbols at the end of the preliminaries would improve readability.
Simulated Author's Rebuttal
We thank the referee for their careful reading, positive assessment of the significance of the Poisson-Moreau drift, and recommendation for minor revision. We address the major comment below.
read point-by-point responses
-
Referee: Main convergence theorem: the claim that the rate is 'arbitrarily close to o(n^{1-2η})' depends on the choice of the Moreau-envelope smoothing parameter as a function of η and the Poisson-solution regularity constants. The manuscript must explicitly derive this parameter selection and show that it remains compatible with the contractivity assumption without introducing hidden dependencies that would alter the exponent.
Authors: We agree that an explicit derivation of the smoothing parameter would improve clarity. In the current manuscript the choice of the Moreau-envelope parameter (denoted ε) is determined inside the proof of the main theorem so that the approximation error is controlled by the target rate while preserving a uniform contraction factor strictly less than one; however, this dependence on η and the Poisson-solution constants (Lipschitz modulus and bound M) is not isolated in a remark or statement preceding the theorem. In the revised version we will add an explicit derivation (as a short lemma or dedicated paragraph in the proof) that selects ε = ε(η, L, M) sufficiently small, independent of n, such that the smoothed mapping remains contractive with modulus α' < 1 that does not depend on n or ε in a way that changes the exponent. The resulting drift inequality then yields the claimed rate o(n^{1-2η}) (arbitrarily close) without hidden n-dependent factors altering the exponent, because all ε-induced error terms are absorbed into the o(·) notation for the chosen scaling. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The central Lyapunov drift is constructed by combining a standard Poisson-equation correction (under explicitly stated existence and regularity conditions for the Markovian noise) with Moreau-envelope smoothing (under the contractivity assumption on the mean-field operator). These are applied to obtain a supermartingale inequality that is then fed into classical Robbins-Siegmund or supermartingale convergence arguments. No equation reduces to a fitted parameter renamed as a prediction, no self-definitional loop appears, and no load-bearing uniqueness theorem or ansatz is imported solely via self-citation. The stated almost-sure rates are direct consequences of the drift inequality plus standard martingale tools; the derivation remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The expected update mapping is contractive
- domain assumption Markovian noise admits a solution to the Poisson equation
invented entities (1)
-
Poisson-Moreau drift
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 4. ... En[V ξ n+1] ≤ (1 − μ_ξ α_n + C_ξ,K r_n) V ξ n + C_ξ,K r_n a.s.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
10 amazon statistics you need to know in 2022
Mohsin, Maryam. 10 amazon statistics you need to know in 2022. Oberlo. 2022
work page 2022
-
[2]
and Deng, Yanzhen and Laber, Eric B
Murphy, Susan A. and Deng, Yanzhen and Laber, Eric B. and Maei, Hamid Reza and Sutton, Richard S. and Witkiewitz, Katie. A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward. ArXiv Preprint. 2016
work page 2016
-
[3]
A Block Coordinate Ascent Algorithm for Mean-Variance Optimization
Xie, Tengyang and Liu, Bo and Xu, Yangyang and Ghavamzadeh, Mohammad and Chow, Yinlam and Lyu, Daoming and Yoon, Daesub. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization. Advances in Neural Information Processing Systems. 2018
work page 2018
-
[4]
A Closer Look at Deep Policy Gradients
Ilyas, Andrew and Engstrom, Logan and Santurkar, Shibani and Tsipras, Dimitris and Janoos, Firdaus and Rudolph, Larry and Madry, Aleksander. A Closer Look at Deep Policy Gradients. Proceedings of the International Conference on Learning Representations. 2020
work page 2020
-
[5]
Lyle, Clare and Bellemare, Marc G. and Castro, Pablo Samuel. A Comparative Analysis of Expected and Distributional Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019
work page 2019
-
[6]
A Concentration Bound for TD (0) with Function Approximation
Chandak, Siddharth and Borkar, Vivek S. A Concentration Bound for TD (0) with Function Approximation. ArXiv Preprint. 2023
work page 2023
-
[7]
Perkins, Theodore J. and Precup, Doina. A Convergent Form of Approximate Policy Iteration. Advances in Neural Information Processing Systems. 2002
work page 2002
-
[8]
and Szepesv \' a ri, Csaba and Maei, Hamid Reza
Sutton, Richard S. and Szepesv \' a ri, Csaba and Maei, Hamid Reza. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. Advances in Neural Information Processing Systems. 2008
work page 2008
-
[9]
A Convergent Off-Policy Temporal Difference Algorithm
Diddigi, Raghuram Bharadwaj and Kamanchi, Chandramouli and Bhatnagar, Shalabh. A Convergent Off-Policy Temporal Difference Algorithm. Proceedings of the European Conference on Artificial Intelligence. 2020
work page 2020
-
[10]
A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms
Zhang, Shangtong and Laroche, Romain and van Seijen, Harm and Whiteson, Shimon and des Combes, Remi Tachet. A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2022
work page 2022
-
[11]
A Deeper Look at Planning as Learning from Replay
Vanseijen, Harm and Sutton, Rich. A Deeper Look at Planning as Learning from Replay. Proceedings of the International Conference on Machine Learning. 2015
work page 2015
-
[12]
A Definition of Continual Reinforcement Learning
Abel, David and Barreto, Andr \'e and Van Roy, Benjamin and Precup, Doina and van Hasselt, Hado and Singh, Satinder. A Definition of Continual Reinforcement Learning. Advances in Neural Information Processing Systems. 2023
work page 2023
-
[13]
and Dabney, Will and Munos, R \' e mi
Bellemare, Marc G. and Dabney, Will and Munos, R \' e mi. A Distributional Perspective on Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017
work page 2017
-
[14]
A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation
Bhandari, Jalaj and Russo, Daniel and Singal, Raghav. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. Proceedings of the Conference on Learning Theory. 2018
work page 2018
-
[15]
A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods
Wu, Yue and Zhang, Weitong and Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods. Advances in Neural Information Processing Systems. 2020
work page 2020
-
[16]
A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation
Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation. Proceedings of the International Conference on Machine Learning. 2020
work page 2020
-
[17]
A Formalization of the Ionescu-Tulcea Theorem in Mathlib
Marion, Etienne. A Formalization of the Ionescu-Tulcea Theorem in Mathlib. ArXiv Preprint. 2025
work page 2025
-
[18]
Yu, Huizhen. A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies. Proceedings of the Conference in Uncertainty in Artificial Intelligence. 2005
work page 2005
-
[19]
A Generalized Reinforcement-Learning Model: Convergence and Applications
Littman, Michael L and Szepesv \'a ri, Csaba. A Generalized Reinforcement-Learning Model: Convergence and Applications. Proceedings of the International Conference on Machine Learning. 1996
work page 1996
-
[20]
and Dabney, Will and Dadashi, Robert and Ta
Bellemare, Marc G. and Dabney, Will and Dadashi, Robert and Ta. A Geometric Perspective on Optimal Representations for Reinforcement Learning. Advances in Neural Information Processing Systems. 2019
work page 2019
-
[21]
A Kernel Loss for Solving the Bellman Equation
Feng, Yihao and Li, Lihong and Liu, Qiang. A Kernel Loss for Solving the Bellman Equation. Advances in Neural Information Processing Systems. 2019
work page 2019
-
[22]
Machado, Marlos C. and Bellemare, Marc G. and Bowling, Michael H. A Laplacian Framework for Option Discovery in Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017
work page 2017
-
[23]
A Liapounov bound for solutions of the Poisson equation
Glynn, Peter W and Meyn, Sean P. A Liapounov bound for solutions of the Poisson equation. The Annals of Probability. 1996
work page 1996
-
[24]
A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation
Chen, Zaiwei and Maguluri, Siva Theja and Shakkottai, Sanjay and Shanmugam, Karthikeyan. A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation. Operations Research. 2023
work page 2023
-
[25]
Bellman, Richard. A Markovian decision process. Journal of Mathematics and Mechanics. 1957
work page 1957
-
[26]
A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs
Lazic, Nevena and Yin, Dong and Farajtabar, Mehrdad and Levine, Nir and G. A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs. Advances in Neural Information Processing Systems. 2020
work page 2020
-
[27]
Oates, Tim and Schmill, Matthew D. and Cohen, Paul R. A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments. Proceedings of the AAAI Conference on Artificial Intelligence. 2000
work page 2000
-
[28]
A New Challenge in Policy Evaluation
Zhang, Shangtong. A New Challenge in Policy Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence. 2023
work page 2023
-
[29]
Chen, Zaiwei and Zhang, Sheng and Zhang, Zhe and Haque, Shaan Ul and Maguluri, Siva Theja. A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms. ArXiv Preprint. 2025
work page 2025
-
[30]
A Nonparametric Offpolicy Policy Gradient
Tosatto, Samuele and Carvalho, Jo a o and Abdulsamad, Hany and Peters, Jan. A Nonparametric Offpolicy Policy Gradient. ArXiv Preprint. 2020
work page 2020
-
[31]
A Reinforcement Learning Method for Maximizing Undiscounted Rewards
Schwartz, Anton. A Reinforcement Learning Method for Maximizing Undiscounted Rewards. Proceedings of the International Conference on Machine Learning. 1993
work page 1993
-
[32]
Edelstein, Michael. A Remark on a Theorem of M. A. Krasnoselski. American Mathematical Monthly. 1966
work page 1966
-
[33]
A Self-Tuning Actor-Critic Algorithm
Zahavy, Tom and Xu, Zhongwen and Veeriah, Vivek and Hessel, Matteo and Oh, Junhyuk and van Hasselt, Hado P and Silver, David and Singh, Satinder. A Self-Tuning Actor-Critic Algorithm. Advances in Neural Information Processing Systems. 2020
work page 2020
-
[34]
A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation
Mitra, Aritra. A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation. IEEE Transactions on Automatic Control. 2025
work page 2025
-
[35]
A Simple Framework for Contrastive Learning of Visual Representations
Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey E. A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the International Conference on Machine Learning. 2020
work page 2020
-
[36]
A Survey for Deep Reinforcement Learning Based Network Intrusion Detection
Yang, Wanrong and Acuto, Alberto and Zhou, Yihang and Wojtczak, Dominik. A Survey for Deep Reinforcement Learning Based Network Intrusion Detection. ArXiv Preprint. 2024
work page 2024
-
[37]
A Survey of Constraint Formulations in Safe Reinforcement Learning
Wachi, Akifumi and Shen, Xun and Sui, Yanan. A Survey of Constraint Formulations in Safe Reinforcement Learning. ArXiv Preprint. 2024
work page 2024
-
[38]
A Survey of In-Context Reinforcement Learning
Moeini, Amir and Wang, Jiuqi and Beck, Jacob and Blaser, Ethan and Whiteson, Shimon and Chandra, Rohan and Zhang, Shangtong. A Survey of In-Context Reinforcement Learning. ArXiv Preprint. 2025
work page 2025
-
[39]
Browne, Cameron and Powley, Edward Jack and Whitehouse, Daniel and Lucas, Simon M. and Cowling, Peter I. and Rohlfshagen, Philipp and Tavener, Stephen and Liebana, Diego Perez and Samothrakis, Spyridon and Colton, Simon. A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games. 2012
work page 2012
-
[40]
A Theoretical Analysis of Deep Q-Learning
Fan, Jianqing and Wang, Zhaoran and Xie, Yuchen and Yang, Zhuoran. A Theoretical Analysis of Deep Q-Learning. Proceedings of the Annual Conference on Learning for Dynamics and Control. 2020
work page 2020
-
[41]
A Tutorial on Meta-Reinforcement Learning
Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon. A Tutorial on Meta-Reinforcement Learning. Foundations and Trends in Machine Learning. 2025
work page 2025
-
[42]
A Unified Switching System Perspective and Convergence Analysis of Q-Learning Algorithms
Lee, Donghwan and He, Niao. A Unified Switching System Perspective and Convergence Analysis of Q-Learning Algorithms. Advances in Neural Information Processing Systems. 2020
work page 2020
-
[43]
Wu, Zechen and Greenwald, Amy and Parr, Ronald. A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning. ArXiv Preprint. 2025
work page 2025
-
[44]
A class of distortion operators for pricing financial and insurance risks
Wang, Shaun S. A class of distortion operators for pricing financial and insurance risks. Journal of Risk and Insurance. 2000
work page 2000
-
[45]
Ezugwu, Absalom E and Ikotun, Abiodun M and Oyelade, Olaide O and Abualigah, Laith and Agushaka, Jeffery O and Eke, Christopher I and Akinyelu, Andronicus A. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence. 2022
work page 2022
-
[46]
A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
Zhou, Ce and Li, Qian and Li, Chen and Yu, Jun and Liu, Yixin and Wang, Guangjing and Zhang, Kai and Ji, Cheng and Yan, Qiben and He, Lifang and others. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. ArXiv Preprint. 2023
work page 2023
-
[47]
A comprehensive survey on safe reinforcement learning
Garc a, Javier and Fern \'a ndez, Fernando. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research. 2015
work page 2015
-
[48]
A concentration bound for contractive stochastic approximation
Borkar, Vivek S. A concentration bound for contractive stochastic approximation. Systems & Control Letters. 2021
work page 2021
-
[49]
A concentration bound for stochastic approximation via Alekseev's formula
Thoppe, Gugan and Borkar, Vivek. A concentration bound for stochastic approximation via Alekseev's formula. Stochastic Systems. 2019
work page 2019
-
[50]
A contextual-bandit approach to personalized news article recommendation
Li, Lihong and Chu, Wei and Langford, John and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. Proceedings of the International Conference on World Wide Web. 2010
work page 2010
-
[51]
A convergence theorem for non negative almost supermartingales and some applications
Robbins, Herbert and Siegmund, David. A convergence theorem for non negative almost supermartingales and some applications. Optimizing Methods in Statistics. 1971
work page 1971
-
[52]
A convergence theory for deep learning via over-parameterization
Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao. A convergence theory for deep learning via over-parameterization. Proceedings of the International Conference on Machine Learning. 2019
work page 2019
-
[53]
A course on multi-armed bandits and reinforcement learning
Agrawal, Shipra. A course on multi-armed bandits and reinforcement learning. 2018
work page 2018
-
[54]
Khodadadian, Sajad and Zubeldia, Martin. A general-purpose theorem for high-probability bounds of stochastic approximation with polyak averaging. ArXiv Preprint. 2025
work page 2025
-
[55]
Reed, Scott and Zolna, Konrad and Parisotto, Emilio and Colmenarejo, Sergio Gomez and Novikov, Alexander and Barth-Maron, Gabriel and Gimenez, Mai and Sulsky, Yury and Kay, Jackie and Springenberg, Jost Tobias and others. A generalist agent. ArXiv Preprint. 2022
work page 2022
-
[56]
A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions
Ramaswamy, Arunselvan and Bhatnagar, Shalabh. A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Mathematics of Operations Research. 2017
work page 2017
-
[57]
A graph placement methodology for fast chip design
Mirhoseini, Azalia and Goldie, Anna and Yazgan, Mustafa and Jiang, Joe Wenjie and Songhori, Ebrahim and Wang, Shen and Lee, Young-Joon and Johnson, Eric and Pathak, Omkar and Nazi, Azade and others. A graph placement methodology for fast chip design. Nature. 2021
work page 2021
-
[58]
Koval, Valery and Schwabe, Rainer. A law of the iterated logarithm for stochastic approximation procedures in d-dimensional Euclidean space. Stochastic Processes and Their Applications. 2003
work page 2003
-
[59]
A lyapunov theory for finite-sample guarantees of markovian stochastic approximation
Chen, Zaiwei and Maguluri, Siva T and Shakkottai, Sanjay and Shanmugam, Karthikeyan. A lyapunov theory for finite-sample guarantees of markovian stochastic approximation. Operations Research. 2024
work page 2024
-
[60]
A lyapunov-based approach to safe reinforcement learning
Chow, Yinlam and Nachum, Ofir and Duenez-Guzman, Edgar and Ghavamzadeh, Mohammad. A lyapunov-based approach to safe reinforcement learning. Advances in Neural Information Processing Systems. 2018
work page 2018
-
[61]
A minimum relative entropy principle for learning and acting
Ortega, Pedro A and Braun, Daniel A. A minimum relative entropy principle for learning and acting. Journal of Artificial Intelligence Research. 2010
work page 2010
-
[62]
A model for the encoding of experiential information
Becker, Joseph D. A model for the encoding of experiential information. Computer Models of Thought and Language. 1973
work page 1973
-
[63]
A multimodal learning interface for grounding spoken language in sensory perceptions
Yu, Chen and Ballard, Dana H. A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perception. 2004
work page 2004
-
[64]
Castelletti, Andrea and Pianosi, Francesca and Restelli, Marcello. A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run. Water Resources Research. 2013
work page 2013
-
[65]
Kakade, Sham M. A natural policy gradient. Advances in Neural Information Processing Systems. 2001
work page 2001
-
[66]
A new Gradient TD Algorithm with only One Step-size: Convergence Rate Analysis using L - Smoothness
Yao, Hengshuai. A new Gradient TD Algorithm with only One Step-size: Convergence Rate Analysis using L - Smoothness. ArXiv Preprint. 2023
work page 2023
-
[67]
A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free
Chen, Yifang and Lee, Chung-Wei and Luo, Haipeng and Wei, Chen-Yu. A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. Proceedings of the Conference on Learning Theory. 2019
work page 2019
-
[68]
Carvalho, Diogo and Melo, Francisco S. and Santos, Pedro. A new convergent variant of Q-learning with linear function approximation. Advances in Neural Information Processing Systems. 2020
work page 2020
-
[69]
A new Q ( ) with interim forward view and Monte Carlo equivalence
Sutton, Richard and Mahmood, Ashique Rupam and Precup, Doina and Hasselt, Hado. A new Q ( ) with interim forward view and Monte Carlo equivalence. Proceedings of the International Conference on Machine Learning. 2014
work page 2014
-
[70]
A note on a conjecture concerning rank one perturbations of singular M-matrices
Anehila, B and Ran, ACM. A note on a conjecture concerning rank one perturbations of singular M-matrices. Quaestiones Mathematicae. 2022
work page 2022
-
[71]
A perspective on off-policy evaluation in reinforcement learning
Li, Lihong. A perspective on off-policy evaluation in reinforcement learning. Frontiers of Computer Science. 2019
work page 2019
-
[72]
A perspective view and survey of meta-learning
Vilalta, Ricardo and Drissi, Youssef. A perspective view and survey of meta-learning. Artificial Intelligence Review. 2002
work page 2002
-
[73]
A pre-training based personalized dialogue generation model with persona-sparse data
Zheng, Yinhe and Zhang, Rongsheng and Huang, Minlie and Mao, Xiaoxi. A pre-training based personalized dialogue generation model with persona-sparse data. Proceedings of the AAAI Conference on Artificial Intelligence. 2020
work page 2020
-
[74]
A primal-dual perspective of online learning algorithms
Shalev-Shwartz, Shai and Singer, Yoram. A primal-dual perspective of online learning algorithms. Machine Learning. 2007
work page 2007
-
[75]
A remark on a theorem of MA Krasnoselski
Edelstein, Michael. A remark on a theorem of MA Krasnoselski. Amer. Math. Monthly. 1966
work page 1966
-
[76]
A review of maximum power point tracking algorithms for wind energy systems
Abdullah, Majid A and Yatim, AHM and Tan, Chee Wei and Saidur, Rahman. A review of maximum power point tracking algorithms for wind energy systems. Renewable and Sustainable Energy Reviews. 2012
work page 2012
-
[77]
A review of safe reinforcement learning: Methods, theory and applications
Gu, Shangding and Yang, Long and Du, Yali and Chen, Guang and Walter, Florian and Wang, Jun and Knoll, Alois. A review of safe reinforcement learning: Methods, theory and applications. ArXiv Preprint. 2022
work page 2022
-
[78]
A rewriting system for convex optimization problems
Agrawal, Akshay and Verschueren, Robin and Diamond, Steven and Boyd, Stephen. A rewriting system for convex optimization problems. Journal of Control and Decision. 2018
work page 2018
-
[79]
Bierkens, Joris and Ran, Andr \'e. A singular M -matrix perturbed by a nonnegative rank one matrix has positive principal minors; is it D -stable?. Linear Algebra and Its Applications. 2014
work page 2014
-
[80]
A small gain analysis of single timescale actor critic
Olshevsky, Alex and Gharesifard, Bahman. A small gain analysis of single timescale actor critic. SIAM Journal on Control and Optimization. 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.