arxiv: 2605.07104 · v1 · submitted 2026-05-08 · 💻 cs.LG · math.OC· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

Xinyu Liu , Zixuan Xie , Shangtong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:19 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords stochastic approximationreinforcement learningalmost sure convergenceMarkovian noisePoisson equationMoreau envelopeLyapunov driftQ-learning

0 comments

The pith

A Poisson-Moreau drift establishes almost sure convergence rates for stochastic approximation under Markovian noise approaching the optimal o(n^{-1}).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a new analysis technique for stochastic approximation algorithms whose expected updates are contractive, a setting that covers many reinforcement learning methods such as Q-learning. By combining Moreau-envelope smoothing with a Poisson-equation correction to handle Markovian noise, the authors derive almost sure convergence rates that are arbitrarily close to o(n^{1-2η}) for power-law learning rates with exponent η between 1/2 and 1, and close to o(n^{-1}) for the harmonic rate. This matters because establishing almost sure rates under dependent noise has been difficult, and the new rates come close to what is known to be optimal in the independent case. A reader would care as these guarantees apply directly to practical algorithms running in real-world sequential decision environments where samples are correlated.

Core claim

The central claim is that a Lyapunov drift constructed by applying a Poisson-equation-based correction for Markovian noise to the Moreau-envelope smoothing of the contractive mapping yields almost sure convergence rates arbitrarily close to o(n^{1-2η}) for stepsizes of order n^{-η} with η ∈ (1/2, 1), and arbitrarily close to o(n^{-1}) for harmonic stepsizes of order n^{-1}, the latter being near the law of the iterated logarithm bound.

What carries the argument

The Poisson-Moreau drift: a Lyapunov function that smooths the contractive mapping via its Moreau envelope and corrects the drift term using the solution to the Poisson equation for the Markov noise process.

If this is right

The rates apply to common RL algorithms like Q-learning and linear TD learning.
For harmonic learning rates the almost sure rate is nearly optimal as per the law of the iterated logarithm.
The analysis extends previous results that were limited to i.i.d. noise or weaker rates.
Power-law rates achieve convergence faster than any power smaller than the claimed exponent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This drift construction might be adapted to other non-i.i.d. settings in optimization if a suitable Poisson solution exists.
Practical RL implementations could use these rates to set learning rate schedules with theoretical backing for almost sure behavior.
If the contractivity assumption holds in more general function approximation, the method could cover deeper RL algorithms.

Load-bearing premise

The expected updates must form a contractive mapping and the Markovian noise must admit a solution to the Poisson equation used in the drift correction.

What would settle it

A counterexample where a contractive stochastic approximation with Markov noise satisfying the Poisson equation fails to achieve the stated almost sure rate, or a direct computation on a simple case showing slower convergence.

read the original abstract

Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are contractive, a setting that arises in many reinforcement learning algorithms such as $Q$-learning and linear temporal difference learning. Specifically, for a power-law learning rate $O(n^{-\eta})$ with $\eta \in (1/2, 1)$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{1 - 2\eta})$. For a harmonic learning rate $O(n^{-1})$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{-1})$, which we argue is a strong result because it is close to the optimal rate $O(n^{-1}\log\log n)$ given by the law of the iterated logarithm (for a special case of i.i.d. noise). Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing for the contractive mapping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives almost sure rates for contractive SA under Markovian noise via a Poisson-Moreau drift that were not available before.

read the letter

The key point is that the authors obtain almost sure convergence rates for stochastic approximation with contractive mean-field updates under Markovian noise. For steps n^{-η} with η between 1/2 and 1 they reach rates arbitrarily close to o(n^{1-2η}), and for the 1/n schedule they reach rates arbitrarily close to o(n^{-1}), which sits near the law-of-iterated-logarithm optimum known for i.i.d. noise. That combination is new in the cited literature on Q-learning and linear TD methods. The technical move is to build a Lyapunov drift that first corrects the Markov dependence with the solution to the Poisson equation and then applies Moreau-envelope smoothing to exploit contractivity. The derivation then feeds into standard supermartingale arguments. The assumptions are the usual ones: contractivity of the expected update and existence of a sufficiently regular Poisson solution. No circular definitions or hidden parameter fitting appear. The rates carry the usual qualifier “arbitrarily close to,” which in practice means some extra logarithmic factors or constant dependence, but that is standard for almost-sure statements and does not undermine the result. The stress-test note indicates the drift inequality and regularity conditions check out without mismatch between smoothing parameter and step-size exponent. This work is aimed at people who need almost-sure guarantees for Markovian RL algorithms rather than expectation or high-probability bounds. It is technically grounded enough to merit a serious referee, even if the proofs will need careful checking on the constants and the precise regularity of the Poisson solution. I would send it to review.

Referee Report

1 major / 3 minor

Summary. The manuscript proposes a new Lyapunov drift technique called the Poisson-Moreau drift for analyzing almost sure convergence rates of stochastic approximation (SA) algorithms with contractive mean updates under Markovian noise. This setting is relevant to reinforcement learning methods such as Q-learning and linear temporal difference learning. For learning rates of the form O(n^{-η}) where η ∈ (1/2, 1), the paper derives an almost sure convergence rate that can be made arbitrarily close to o(n^{1-2η}). For the harmonic learning rate O(n^{-1}), the rate is arbitrarily close to o(n^{-1}), argued to be strong as it approaches the optimal rate from the law of the iterated logarithm in the i.i.d. noise case. The key innovation is applying a Poisson-equation based correction for the Markovian noise to the Moreau-envelope smoothed contractive mapping, leading to a drift inequality amenable to supermartingale arguments.

Significance. This work is significant because establishing almost sure rates for SA under Markovian noise has been challenging, and the results here are close to optimal. The Poisson-Moreau construction provides a systematic way to handle both the noise correlation and the contractivity, potentially applicable to other algorithms. Strengths include the explicit conditions for the Poisson solution existence and regularity, and the use of standard tools like Robbins-Siegmund lemma after the drift construction. If verified, it advances the theory for RL convergence analysis.

major comments (1)

Main convergence theorem: the claim that the rate is 'arbitrarily close to o(n^{1-2η})' depends on the choice of the Moreau-envelope smoothing parameter as a function of η and the Poisson-solution regularity constants. The manuscript must explicitly derive this parameter selection and show that it remains compatible with the contractivity assumption without introducing hidden dependencies that would alter the exponent.

minor comments (3)

Abstract: the applicability to Q-learning and linear TD is stated but not illustrated with a concrete mapping of the contractive update and Poisson equation; adding one short example would immediately clarify the scope.
Related-work section: prior results on Poisson corrections for Markovian SA (e.g., those using different Lyapunov constructions) are referenced but the precise technical advantage of the Moreau-envelope step over those approaches is not contrasted in a dedicated paragraph.
Notation: the symbols for the Poisson solution (its boundedness and Lipschitz constants) and the Moreau envelope parameter appear in multiple places; a short table of symbols at the end of the preliminaries would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment of the significance of the Poisson-Moreau drift, and recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: Main convergence theorem: the claim that the rate is 'arbitrarily close to o(n^{1-2η})' depends on the choice of the Moreau-envelope smoothing parameter as a function of η and the Poisson-solution regularity constants. The manuscript must explicitly derive this parameter selection and show that it remains compatible with the contractivity assumption without introducing hidden dependencies that would alter the exponent.

Authors: We agree that an explicit derivation of the smoothing parameter would improve clarity. In the current manuscript the choice of the Moreau-envelope parameter (denoted ε) is determined inside the proof of the main theorem so that the approximation error is controlled by the target rate while preserving a uniform contraction factor strictly less than one; however, this dependence on η and the Poisson-solution constants (Lipschitz modulus and bound M) is not isolated in a remark or statement preceding the theorem. In the revised version we will add an explicit derivation (as a short lemma or dedicated paragraph in the proof) that selects ε = ε(η, L, M) sufficiently small, independent of n, such that the smoothed mapping remains contractive with modulus α' < 1 that does not depend on n or ε in a way that changes the exponent. The resulting drift inequality then yields the claimed rate o(n^{1-2η}) (arbitrarily close) without hidden n-dependent factors altering the exponent, because all ε-induced error terms are absorbed into the o(·) notation for the chosen scaling. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The central Lyapunov drift is constructed by combining a standard Poisson-equation correction (under explicitly stated existence and regularity conditions for the Markovian noise) with Moreau-envelope smoothing (under the contractivity assumption on the mean-field operator). These are applied to obtain a supermartingale inequality that is then fed into classical Robbins-Siegmund or supermartingale convergence arguments. No equation reduces to a fitted parameter renamed as a prediction, no self-definitional loop appears, and no load-bearing uniqueness theorem or ansatz is imported solely via self-citation. The stated almost-sure rates are direct consequences of the drift inequality plus standard martingale tools; the derivation remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the contractivity of the expected update mapping and the existence of a Poisson equation solution for the Markovian noise; these are domain assumptions standard in the field but not independently verified here.

axioms (2)

domain assumption The expected update mapping is contractive
Stated explicitly as the setting for the class of algorithms including Q-learning and linear TD.
domain assumption Markovian noise admits a solution to the Poisson equation
Invoked for the correction term in the novel Lyapunov drift construction.

invented entities (1)

Poisson-Moreau drift no independent evidence
purpose: Novel Lyapunov function for establishing almost sure convergence rates
Introduced as the key technical tool combining Poisson correction and Moreau-envelope smoothing.

pith-pipeline@v0.9.0 · 5501 in / 1367 out tokens · 36335 ms · 2026-05-11T01:19:38.000039+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 4. ... En[V ξ n+1] ≤ (1 − μ_ξ α_n + C_ξ,K r_n) V ξ n + C_ξ,K r_n a.s.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages

[1]

10 amazon statistics you need to know in 2022

Mohsin, Maryam. 10 amazon statistics you need to know in 2022. Oberlo. 2022

work page 2022
[2]

and Deng, Yanzhen and Laber, Eric B

Murphy, Susan A. and Deng, Yanzhen and Laber, Eric B. and Maei, Hamid Reza and Sutton, Richard S. and Witkiewitz, Katie. A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward. ArXiv Preprint. 2016

work page 2016
[3]

A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Xie, Tengyang and Liu, Bo and Xu, Yangyang and Ghavamzadeh, Mohammad and Chow, Yinlam and Lyu, Daoming and Yoon, Daesub. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization. Advances in Neural Information Processing Systems. 2018

work page 2018
[4]

A Closer Look at Deep Policy Gradients

Ilyas, Andrew and Engstrom, Logan and Santurkar, Shibani and Tsipras, Dimitris and Janoos, Firdaus and Rudolph, Larry and Madry, Aleksander. A Closer Look at Deep Policy Gradients. Proceedings of the International Conference on Learning Representations. 2020

work page 2020
[5]

and Castro, Pablo Samuel

Lyle, Clare and Bellemare, Marc G. and Castro, Pablo Samuel. A Comparative Analysis of Expected and Distributional Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019

work page 2019
[6]

A Concentration Bound for TD (0) with Function Approximation

Chandak, Siddharth and Borkar, Vivek S. A Concentration Bound for TD (0) with Function Approximation. ArXiv Preprint. 2023

work page 2023
[7]

and Precup, Doina

Perkins, Theodore J. and Precup, Doina. A Convergent Form of Approximate Policy Iteration. Advances in Neural Information Processing Systems. 2002

work page 2002
[8]

and Szepesv \' a ri, Csaba and Maei, Hamid Reza

Sutton, Richard S. and Szepesv \' a ri, Csaba and Maei, Hamid Reza. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. Advances in Neural Information Processing Systems. 2008

work page 2008
[9]

A Convergent Off-Policy Temporal Difference Algorithm

Diddigi, Raghuram Bharadwaj and Kamanchi, Chandramouli and Bhatnagar, Shalabh. A Convergent Off-Policy Temporal Difference Algorithm. Proceedings of the European Conference on Artificial Intelligence. 2020

work page 2020
[10]

A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

Zhang, Shangtong and Laroche, Romain and van Seijen, Harm and Whiteson, Shimon and des Combes, Remi Tachet. A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2022

work page 2022
[11]

A Deeper Look at Planning as Learning from Replay

Vanseijen, Harm and Sutton, Rich. A Deeper Look at Planning as Learning from Replay. Proceedings of the International Conference on Machine Learning. 2015

work page 2015
[12]

A Definition of Continual Reinforcement Learning

Abel, David and Barreto, Andr \'e and Van Roy, Benjamin and Precup, Doina and van Hasselt, Hado and Singh, Satinder. A Definition of Continual Reinforcement Learning. Advances in Neural Information Processing Systems. 2023

work page 2023
[13]

and Dabney, Will and Munos, R \' e mi

Bellemare, Marc G. and Dabney, Will and Munos, R \' e mi. A Distributional Perspective on Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

work page 2017
[14]

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Bhandari, Jalaj and Russo, Daniel and Singal, Raghav. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. Proceedings of the Conference on Learning Theory. 2018

work page 2018
[15]

A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods

Wu, Yue and Zhang, Weitong and Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods. Advances in Neural Information Processing Systems. 2020

work page 2020
[16]

A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation

Xu, Pan and Gu, Quanquan. A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation. Proceedings of the International Conference on Machine Learning. 2020

work page 2020
[17]

A Formalization of the Ionescu-Tulcea Theorem in Mathlib

Marion, Etienne. A Formalization of the Ionescu-Tulcea Theorem in Mathlib. ArXiv Preprint. 2025

work page 2025
[18]

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

Yu, Huizhen. A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies. Proceedings of the Conference in Uncertainty in Artificial Intelligence. 2005

work page 2005
[19]

A Generalized Reinforcement-Learning Model: Convergence and Applications

Littman, Michael L and Szepesv \'a ri, Csaba. A Generalized Reinforcement-Learning Model: Convergence and Applications. Proceedings of the International Conference on Machine Learning. 1996

work page 1996
[20]

and Dabney, Will and Dadashi, Robert and Ta

Bellemare, Marc G. and Dabney, Will and Dadashi, Robert and Ta. A Geometric Perspective on Optimal Representations for Reinforcement Learning. Advances in Neural Information Processing Systems. 2019

work page 2019
[21]

A Kernel Loss for Solving the Bellman Equation

Feng, Yihao and Li, Lihong and Liu, Qiang. A Kernel Loss for Solving the Bellman Equation. Advances in Neural Information Processing Systems. 2019

work page 2019
[22]

and Bellemare, Marc G

Machado, Marlos C. and Bellemare, Marc G. and Bowling, Michael H. A Laplacian Framework for Option Discovery in Reinforcement Learning. Proceedings of the International Conference on Machine Learning. 2017

work page 2017
[23]

A Liapounov bound for solutions of the Poisson equation

Glynn, Peter W and Meyn, Sean P. A Liapounov bound for solutions of the Poisson equation. The Annals of Probability. 1996

work page 1996
[24]

A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation

Chen, Zaiwei and Maguluri, Siva Theja and Shakkottai, Sanjay and Shanmugam, Karthikeyan. A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation. Operations Research. 2023

work page 2023
[25]

A Markovian decision process

Bellman, Richard. A Markovian decision process. Journal of Mathematics and Mechanics. 1957

work page 1957
[26]

A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Lazic, Nevena and Yin, Dong and Farajtabar, Mehrdad and Levine, Nir and G. A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs. Advances in Neural Information Processing Systems. 2020

work page 2020
[27]

and Cohen, Paul R

Oates, Tim and Schmill, Matthew D. and Cohen, Paul R. A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments. Proceedings of the AAAI Conference on Artificial Intelligence. 2000

work page 2000
[28]

A New Challenge in Policy Evaluation

Zhang, Shangtong. A New Challenge in Policy Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence. 2023

work page 2023
[29]

A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms

Chen, Zaiwei and Zhang, Sheng and Zhang, Zhe and Haque, Shaan Ul and Maguluri, Siva Theja. A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms. ArXiv Preprint. 2025

work page 2025
[30]

A Nonparametric Offpolicy Policy Gradient

Tosatto, Samuele and Carvalho, Jo a o and Abdulsamad, Hany and Peters, Jan. A Nonparametric Offpolicy Policy Gradient. ArXiv Preprint. 2020

work page 2020
[31]

A Reinforcement Learning Method for Maximizing Undiscounted Rewards

Schwartz, Anton. A Reinforcement Learning Method for Maximizing Undiscounted Rewards. Proceedings of the International Conference on Machine Learning. 1993

work page 1993
[32]

A Remark on a Theorem of M

Edelstein, Michael. A Remark on a Theorem of M. A. Krasnoselski. American Mathematical Monthly. 1966

work page 1966
[33]

A Self-Tuning Actor-Critic Algorithm

Zahavy, Tom and Xu, Zhongwen and Veeriah, Vivek and Hessel, Matteo and Oh, Junhyuk and van Hasselt, Hado P and Silver, David and Singh, Satinder. A Self-Tuning Actor-Critic Algorithm. Advances in Neural Information Processing Systems. 2020

work page 2020
[34]

A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation

Mitra, Aritra. A Simple Finite-Time Analysis of TD Learning With Linear Function Approximation. IEEE Transactions on Automatic Control. 2025

work page 2025
[35]

A Simple Framework for Contrastive Learning of Visual Representations

Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey E. A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the International Conference on Machine Learning. 2020

work page 2020
[36]

A Survey for Deep Reinforcement Learning Based Network Intrusion Detection

Yang, Wanrong and Acuto, Alberto and Zhou, Yihang and Wojtczak, Dominik. A Survey for Deep Reinforcement Learning Based Network Intrusion Detection. ArXiv Preprint. 2024

work page 2024
[37]

A Survey of Constraint Formulations in Safe Reinforcement Learning

Wachi, Akifumi and Shen, Xun and Sui, Yanan. A Survey of Constraint Formulations in Safe Reinforcement Learning. ArXiv Preprint. 2024

work page 2024
[38]

A Survey of In-Context Reinforcement Learning

Moeini, Amir and Wang, Jiuqi and Beck, Jacob and Blaser, Ethan and Whiteson, Shimon and Chandra, Rohan and Zhang, Shangtong. A Survey of In-Context Reinforcement Learning. ArXiv Preprint. 2025

work page 2025
[39]

and Cowling, Peter I

Browne, Cameron and Powley, Edward Jack and Whitehouse, Daniel and Lucas, Simon M. and Cowling, Peter I. and Rohlfshagen, Philipp and Tavener, Stephen and Liebana, Diego Perez and Samothrakis, Spyridon and Colton, Simon. A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games. 2012

work page 2012
[40]

A Theoretical Analysis of Deep Q-Learning

Fan, Jianqing and Wang, Zhaoran and Xie, Yuchen and Yang, Zhuoran. A Theoretical Analysis of Deep Q-Learning. Proceedings of the Annual Conference on Learning for Dynamics and Control. 2020

work page 2020
[41]

A Tutorial on Meta-Reinforcement Learning

Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon. A Tutorial on Meta-Reinforcement Learning. Foundations and Trends in Machine Learning. 2025

work page 2025
[42]

A Unified Switching System Perspective and Convergence Analysis of Q-Learning Algorithms

Lee, Donghwan and He, Niao. A Unified Switching System Perspective and Convergence Analysis of Q-Learning Algorithms. Advances in Neural Information Processing Systems. 2020

work page 2020
[43]

A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

Wu, Zechen and Greenwald, Amy and Parr, Ronald. A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning. ArXiv Preprint. 2025

work page 2025
[44]

A class of distortion operators for pricing financial and insurance risks

Wang, Shaun S. A class of distortion operators for pricing financial and insurance risks. Journal of Risk and Insurance. 2000

work page 2000
[45]

A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects

Ezugwu, Absalom E and Ikotun, Abiodun M and Oyelade, Olaide O and Abualigah, Laith and Agushaka, Jeffery O and Eke, Christopher I and Akinyelu, Andronicus A. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence. 2022

work page 2022
[46]

A comprehensive survey on pretrained foundation models: A history from bert to chatgpt

Zhou, Ce and Li, Qian and Li, Chen and Yu, Jun and Liu, Yixin and Wang, Guangjing and Zhang, Kai and Ji, Cheng and Yan, Qiben and He, Lifang and others. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. ArXiv Preprint. 2023

work page 2023
[47]

A comprehensive survey on safe reinforcement learning

Garc a, Javier and Fern \'a ndez, Fernando. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research. 2015

work page 2015
[48]

A concentration bound for contractive stochastic approximation

Borkar, Vivek S. A concentration bound for contractive stochastic approximation. Systems & Control Letters. 2021

work page 2021
[49]

A concentration bound for stochastic approximation via Alekseev's formula

Thoppe, Gugan and Borkar, Vivek. A concentration bound for stochastic approximation via Alekseev's formula. Stochastic Systems. 2019

work page 2019
[50]

A contextual-bandit approach to personalized news article recommendation

Li, Lihong and Chu, Wei and Langford, John and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. Proceedings of the International Conference on World Wide Web. 2010

work page 2010
[51]

A convergence theorem for non negative almost supermartingales and some applications

Robbins, Herbert and Siegmund, David. A convergence theorem for non negative almost supermartingales and some applications. Optimizing Methods in Statistics. 1971

work page 1971
[52]

A convergence theory for deep learning via over-parameterization

Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao. A convergence theory for deep learning via over-parameterization. Proceedings of the International Conference on Machine Learning. 2019

work page 2019
[53]

A course on multi-armed bandits and reinforcement learning

Agrawal, Shipra. A course on multi-armed bandits and reinforcement learning. 2018

work page 2018
[54]

A general-purpose theorem for high-probability bounds of stochastic approximation with polyak averaging

Khodadadian, Sajad and Zubeldia, Martin. A general-purpose theorem for high-probability bounds of stochastic approximation with polyak averaging. ArXiv Preprint. 2025

work page 2025
[55]

A generalist agent

Reed, Scott and Zolna, Konrad and Parisotto, Emilio and Colmenarejo, Sergio Gomez and Novikov, Alexander and Barth-Maron, Gabriel and Gimenez, Mai and Sulsky, Yury and Kay, Jackie and Springenberg, Jost Tobias and others. A generalist agent. ArXiv Preprint. 2022

work page 2022
[56]

A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions

Ramaswamy, Arunselvan and Bhatnagar, Shalabh. A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Mathematics of Operations Research. 2017

work page 2017
[57]

A graph placement methodology for fast chip design

Mirhoseini, Azalia and Goldie, Anna and Yazgan, Mustafa and Jiang, Joe Wenjie and Songhori, Ebrahim and Wang, Shen and Lee, Young-Joon and Johnson, Eric and Pathak, Omkar and Nazi, Azade and others. A graph placement methodology for fast chip design. Nature. 2021

work page 2021
[58]

A law of the iterated logarithm for stochastic approximation procedures in d-dimensional Euclidean space

Koval, Valery and Schwabe, Rainer. A law of the iterated logarithm for stochastic approximation procedures in d-dimensional Euclidean space. Stochastic Processes and Their Applications. 2003

work page 2003
[59]

A lyapunov theory for finite-sample guarantees of markovian stochastic approximation

Chen, Zaiwei and Maguluri, Siva T and Shakkottai, Sanjay and Shanmugam, Karthikeyan. A lyapunov theory for finite-sample guarantees of markovian stochastic approximation. Operations Research. 2024

work page 2024
[60]

A lyapunov-based approach to safe reinforcement learning

Chow, Yinlam and Nachum, Ofir and Duenez-Guzman, Edgar and Ghavamzadeh, Mohammad. A lyapunov-based approach to safe reinforcement learning. Advances in Neural Information Processing Systems. 2018

work page 2018
[61]

A minimum relative entropy principle for learning and acting

Ortega, Pedro A and Braun, Daniel A. A minimum relative entropy principle for learning and acting. Journal of Artificial Intelligence Research. 2010

work page 2010
[62]

A model for the encoding of experiential information

Becker, Joseph D. A model for the encoding of experiential information. Computer Models of Thought and Language. 1973

work page 1973
[63]

A multimodal learning interface for grounding spoken language in sensory perceptions

Yu, Chen and Ballard, Dana H. A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perception. 2004

work page 2004
[64]

A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run

Castelletti, Andrea and Pianosi, Francesca and Restelli, Marcello. A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run. Water Resources Research. 2013

work page 2013
[65]

A natural policy gradient

Kakade, Sham M. A natural policy gradient. Advances in Neural Information Processing Systems. 2001

work page 2001
[66]

A new Gradient TD Algorithm with only One Step-size: Convergence Rate Analysis using L - Smoothness

Yao, Hengshuai. A new Gradient TD Algorithm with only One Step-size: Convergence Rate Analysis using L - Smoothness. ArXiv Preprint. 2023

work page 2023
[67]

A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free

Chen, Yifang and Lee, Chung-Wei and Luo, Haipeng and Wei, Chen-Yu. A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. Proceedings of the Conference on Learning Theory. 2019

work page 2019
[68]

and Santos, Pedro

Carvalho, Diogo and Melo, Francisco S. and Santos, Pedro. A new convergent variant of Q-learning with linear function approximation. Advances in Neural Information Processing Systems. 2020

work page 2020
[69]

A new Q ( ) with interim forward view and Monte Carlo equivalence

Sutton, Richard and Mahmood, Ashique Rupam and Precup, Doina and Hasselt, Hado. A new Q ( ) with interim forward view and Monte Carlo equivalence. Proceedings of the International Conference on Machine Learning. 2014

work page 2014
[70]

A note on a conjecture concerning rank one perturbations of singular M-matrices

Anehila, B and Ran, ACM. A note on a conjecture concerning rank one perturbations of singular M-matrices. Quaestiones Mathematicae. 2022

work page 2022
[71]

A perspective on off-policy evaluation in reinforcement learning

Li, Lihong. A perspective on off-policy evaluation in reinforcement learning. Frontiers of Computer Science. 2019

work page 2019
[72]

A perspective view and survey of meta-learning

Vilalta, Ricardo and Drissi, Youssef. A perspective view and survey of meta-learning. Artificial Intelligence Review. 2002

work page 2002
[73]

A pre-training based personalized dialogue generation model with persona-sparse data

Zheng, Yinhe and Zhang, Rongsheng and Huang, Minlie and Mao, Xiaoxi. A pre-training based personalized dialogue generation model with persona-sparse data. Proceedings of the AAAI Conference on Artificial Intelligence. 2020

work page 2020
[74]

A primal-dual perspective of online learning algorithms

Shalev-Shwartz, Shai and Singer, Yoram. A primal-dual perspective of online learning algorithms. Machine Learning. 2007

work page 2007
[75]

A remark on a theorem of MA Krasnoselski

Edelstein, Michael. A remark on a theorem of MA Krasnoselski. Amer. Math. Monthly. 1966

work page 1966
[76]

A review of maximum power point tracking algorithms for wind energy systems

Abdullah, Majid A and Yatim, AHM and Tan, Chee Wei and Saidur, Rahman. A review of maximum power point tracking algorithms for wind energy systems. Renewable and Sustainable Energy Reviews. 2012

work page 2012
[77]

A review of safe reinforcement learning: Methods, theory and applications

Gu, Shangding and Yang, Long and Du, Yali and Chen, Guang and Walter, Florian and Wang, Jun and Knoll, Alois. A review of safe reinforcement learning: Methods, theory and applications. ArXiv Preprint. 2022

work page 2022
[78]

A rewriting system for convex optimization problems

Agrawal, Akshay and Verschueren, Robin and Diamond, Steven and Boyd, Stephen. A rewriting system for convex optimization problems. Journal of Control and Decision. 2018

work page 2018
[79]

A singular M -matrix perturbed by a nonnegative rank one matrix has positive principal minors; is it D -stable?

Bierkens, Joris and Ran, Andr \'e. A singular M -matrix perturbed by a nonnegative rank one matrix has positive principal minors; is it D -stable?. Linear Algebra and Its Applications. 2014

work page 2014
[80]

A small gain analysis of single timescale actor critic

Olshevsky, Alex and Gharesifard, Bahman. A small gain analysis of single timescale actor critic. SIAM Journal on Control and Optimization. 2023

work page 2023

Showing first 80 references.