Toward a Unified Lyapunov-Certified ODE Convergence Analysis of Smooth Q-Learning with p-Norms

Donghwan Lee; Hyunjun Na

arxiv: 2404.14442 · v7 · submitted 2024-04-20 · 💻 cs.LG · cs.AI

Toward a Unified Lyapunov-Certified ODE Convergence Analysis of Smooth Q-Learning with p-Norms

Donghwan Lee , Hyunjun Na This is my paper

Pith reviewed 2026-05-24 02:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Q-learningODE methodLyapunov functionp-normsoft Q-learningBellman operatorconvergence analysisreinforcement learning

0 comments

The pith

A smooth p-norm Lyapunov function unifies ODE convergence analysis for Q-learning and its smoothed variants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified ODE-based convergence framework that covers standard Q-learning together with smoothed versions built on log-sum-exponential softmax, Boltzmann softmax, and mellowmax operators. It relies on a smooth p-norm Lyapunov function to obtain concise stability certificates for the associated continuous-time ODEs. This method sidesteps the non-smoothness difficulties of classical infinity-norm arguments and continues to work when the Bellman operator fails to be a contraction. A sympathetic reader would see the result as a single off-the-shelf tool that can certify convergence across a wider family of Q-learning algorithms than previously available.

Core claim

The central claim is that a smooth p-norm Lyapunov function yields rigorous stability arguments for the ODE systems that arise from the log-sum-exponential softmax, Boltzmann softmax, and mellowmax operators, while also recovering the standard Q-learning case. The same construction remains valid in regimes where the Bellman operator is not contractive, such as Boltzmann soft Q-learning.

What carries the argument

The smooth p-norm Lyapunov function, which certifies global asymptotic stability of the equilibrium for each of the listed ODEs.

If this is right

The same analysis applies directly to classical Q-learning.
It covers log-sum-exponential softmax Q-learning.
It covers Boltzmann softmax Q-learning.
It covers mellowmax Q-learning.
Stability holds without requiring the Bellman operator to be a contraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The construction may serve as a template for analyzing other families of smoothed Bellman operators not treated in the paper.
Numerical integration of the ODEs could be used to predict the transient speed of the corresponding discrete algorithms.
The non-contraction setting raises the possibility that convergence rate depends on the choice of p rather than on contraction modulus.
Extensions to stochastic approximation or asynchronous updates would follow the same Lyapunov template once the ODE limit is established.

Load-bearing premise

The chosen p-norm Lyapunov function decreases along every trajectory of the ODE for each operator under consideration.

What would settle it

An explicit state trajectory of one of the ODEs along which the time derivative of the p-norm Lyapunov function is positive would falsify the stability claim.

Figures

Figures reproduced from arXiv: 2404.14442 by Donghwan Lee, Hyunjun Na.

**Figure 2.** Figure 2: Empirical stability analysis of the smooth Q-learning with Bolt [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical stability analysis under different softmax operat [PITH_FULL_IMAGE:figures/full_fig_p033_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical stability analysis under different softmax operat [PITH_FULL_IMAGE:figures/full_fig_p034_4.png] view at source ↗

read the original abstract

Convergence of Q-learning has been the subject of extensive study for decades. Among the available techniques, the ordinary differential equation (ODE) method is particularly appealing as a general-purpose, off-the-shelf tool for sanity-checking the convergence of a wide range of reinforcement learning algorithms. In this paper, we develop a unified ODE-based convergence framework that applies to standard Q-learning and several soft/smoothed variants, including those built on the log-sum-exponential softmax, Boltzmann softmax, and mellowmax operators. Our analysis uses a smooth p-norm Lyapunov function, leading to concise yet rigorous stability arguments and circumventing the non-smoothness issues inherent to classical infty-norm-based approaches. To the best of our knowledge, the proposed framework is among the first to provide a unified ODE-based treatment that is broadly applicable to smooth Q-learning algorithms while also encompassing standard Q-learning. Moreover, it remains valid even in settings where the associated Bellman operator is not a contraction, as may happen in Boltzmann soft Q-learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a p-norm Lyapunov function for a unified ODE analysis of Q-learning and its soft variants, with a claim that it works even without contraction.

read the letter

The main takeaway is that this paper develops a unified ODE convergence framework for standard Q-learning plus smoothed versions based on log-sum-exp, Boltzmann, and mellowmax operators. It uses a smooth p-norm Lyapunov function to get stability arguments and says the approach holds even when the Bellman operator is not a contraction. That non-contraction coverage is the part worth noting first. The work tries to replace the usual infinity-norm analysis with something smoother to keep the arguments concise. On the positive side, pulling multiple operators into one framework could save time for people who want a general check on convergence across these algorithms. The abstract positions the p-norm choice as a practical fix for non-smoothness issues that show up in classical treatments. The soft spots are in the details that are not visible from the abstract alone. It is not clear yet whether the p-norm actually delivers rigorous stability in the non-contraction regime without extra assumptions or case-by-case tweaks, and the literature comparison needs checking to confirm how much is genuinely new versus an extension of existing ODE work. The proofs will have to stand on their own. This paper is aimed at researchers who do theoretical analysis of RL algorithms and already use ODE or Lyapunov methods. A reader working on soft Q-learning or convergence verification might get something out of the framework if the math checks. It deserves a serious referee because the topic sits inside an active subfield and the stated goal is concrete enough to merit expert review on the derivations.

Referee Report

0 major / 2 minor

Summary. The paper develops a unified ODE-based convergence framework applicable to standard Q-learning as well as smoothed variants based on log-sum-exponential softmax, Boltzmann softmax, and mellowmax operators. It employs a smooth p-norm Lyapunov function to derive concise rigorous stability arguments for the associated ODE systems, with the analysis claimed to remain valid even when the Bellman operator is not a contraction.

Significance. If the central claims hold, the work supplies a general-purpose Lyapunov-certified ODE tool that handles both contractive and certain non-contractive cases while avoiding non-smoothness difficulties of classical infinity-norm arguments; this could streamline convergence proofs across a family of Q-learning algorithms and is a potentially useful addition to the RL theory literature.

minor comments (2)

The abstract states that the framework 'remains valid even in settings where the associated Bellman operator is not a contraction'; the introduction or §2 should explicitly identify the precise conditions on the operator under which the p-norm Lyapunov argument continues to apply, with a short counter-example or reference to the Boltzmann case.
Notation for the p-norm Lyapunov function and the associated ODE should be introduced with a single consolidated definition (rather than scattered across sections) to improve readability for readers unfamiliar with the smoothed operators.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive evaluation of the manuscript. We are pleased that the unified ODE convergence framework using smooth p-norm Lyapunov functions is viewed as a potentially useful addition to the RL theory literature, particularly for its applicability to both contractive and certain non-contractive cases. Since the report contains no specific major comments requiring clarification or revision, we provide no point-by-point responses below.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and context describe a theoretical derivation of a unified ODE convergence framework for Q-learning variants via a smooth p-norm Lyapunov function. No load-bearing steps are visible that reduce claims to self-definitions, fitted inputs renamed as predictions, or self-citation chains. The stability arguments are presented as independent mathematical constructions that apply even without contractivity, with no quoted equations or citations indicating that any result is equivalent to its inputs by construction. This is the expected outcome for a self-contained theoretical analysis paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; ledger is empty by default.

pith-pipeline@v0.9.0 · 5704 in / 1056 out tokens · 28134 ms · 2026-05-24T02:14:21.053097+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contraction-Aligned Analysis of Soft Bellman Residual Minimization with Weighted Lp-Norm for Markov Decision Problem
cs.LG 2026-04 unverdicted novelty 6.0

Soft Bellman residual minimization with weighted Lp-norm aligns the objective with Bellman contraction as p increases and yields performance error bounds.
Safe-Support Q-Learning: Learning without Unsafe Exploration
cs.LG 2026-04 unverdicted novelty 5.0

Safe-Support Q-Learning trains Q-functions and policies in reinforcement learning without ever visiting unsafe states by constraining the behavior policy to a safe set and using KL-regularized Bellman targets in a two...

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction . MIT Press, 1998

work page 1998
[2]

Human-level control through deep rein- forcement learning,

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. B ellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep rein- forcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015

work page 2015
[3]

Q-learning,

C. J. Watkins and P. Dayan, “Q-learning,” Machine learning , vol. 8, no. 3-4, pp. 279–292, 1992. 12

work page 1992
[4]

Convergence of st ochastic iterative dynamic programming algorithms,

T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of st ochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems , 1994, pp. 703–710

work page 1994
[5]

The ODE method for convergence of stochastic approximation and reinforcement learning,

V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization , vol. 38, no. 2, pp. 447–469, 2000

work page 2000
[6]

A uniﬁed switching system perspective and con vergence analysis of Q- learning algorithms,

D. Lee and N. He, “A uniﬁed switching system perspective and con vergence analysis of Q- learning algorithms,” in 34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

work page 2020
[7]

Asynchronous stochastic approximation and Q- learning,

J. N. Tsitsiklis, “Asynchronous stochastic approximation and Q- learning,” Machine learning, vol. 16, no. 3, pp. 185–202, 1994

work page 1994
[8]

Conv ergence results for single-step on-policy reinforcement-learning algorithms,

S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesv´ ari, “Conv ergence results for single-step on-policy reinforcement-learning algorithms,” Machine learning , vol. 38, pp. 287–308, 2000

work page 2000
[9]

The asymptotic convergence-rate of Q-lear ning,

C. Szepesv´ ari, “The asymptotic convergence-rate of Q-lear ning,” in Advances in Neural In- formation Processing Systems , 1998, pp. 1064–1070

work page 1998
[10]

Learning rates for Q-learning,

E. Even-Dar and Y. Mansour, “Learning rates for Q-learning,” Journal of machine learning Research, vol. 5, no. Dec, pp. 1–25, 2003

work page 2003
[11]

Error bounds for constant step- size Q-learning,

C. L. Beck and R. Srikant, “Error bounds for constant step- size Q-learning,” Systems & Control letters , vol. 61, no. 12, pp. 1203–1208, 2012

work page 2012
[12]

Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

M. J. Wainwright, “Stochastic approximation with cone-contra ctive operators: Sharp ℓ∞ - bounds for Q-learning,” arXiv preprint arXiv:1905.06265 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[13]

Finite-time analysis of asynchronous sto chastic approximation and Q-learning,

G. Qu and A. Wierman, “Finite-time analysis of asynchronous sto chastic approximation and Q-learning,” arXiv preprint arXiv:2002.00260 , 2020

work page arXiv 2002
[14]

Sample complexity of asy nchronous Q-learning: Sharper analysis and variance reduction,

G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, “Sample complexity of asy nchronous Q-learning: Sharper analysis and variance reduction,” arXiv preprint arXiv:2006.03041 , 2020

work page arXiv 2006
[15]

A L yapunov theory for ﬁnite- sample guarantees of asynchronous Q-learning and TD-learning va riants,

Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, “A L yapunov theory for ﬁnite- sample guarantees of asynchronous Q-learning and TD-learning va riants,” arXiv preprint arXiv:2102.01567, 2021

work page arXiv 2021
[16]

Finite-sample analysis of contractive stochastic approxim ation using smooth convex envelopes,

——, “Finite-sample analysis of contractive stochastic approxim ation using smooth convex envelopes,” Advances in Neural Information Processing Systems , vol. 33, pp. 8223–8234, 2020

work page 2020
[17]

Final iteration convergence of Q-learning: Switching s ystem approach,

D. Lee, “Final iteration convergence of Q-learning: Switching s ystem approach,” IEEE Trans- actions on Automatic Control , 2024

work page 2024
[18]

A convergento(n) temporal-diﬀerence algorithm for oﬀ-policy learning with linear function approximation,

R. S. Sutton, H. R. Maei, and C. Szepesv´ ari, “A convergento(n) temporal-diﬀerence algorithm for oﬀ-policy learning with linear function approximation,” in Advances in neural information processing systems, 2009, pp. 1609–1616. 13

work page 2009
[19]

Fast gradient-descent methods for temporal-diﬀerence learnin g with linear function approx- imation,

R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C . Szepesv´ ari, and E. Wiewiora, “Fast gradient-descent methods for temporal-diﬀerence learnin g with linear function approx- imation,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 993–1000

work page 2009
[20]

Gradient temporal- diﬀerence learning with regularized corrections,

S. Ghiassian, A. Patterson, S. Garg, D. Gupta, A. White, and M . White, “Gradient temporal- diﬀerence learning with regularized corrections,” in International Conference on Machine Learning, 2020, pp. 3524–3534

work page 2020
[21]

New versions of gradien t temporal diﬀerence learning,

D. Lee, H.-D. Lim, J. Park, and O. Choi, “New versions of gradien t temporal diﬀerence learning,” IEEE Transactions on Automatic Control , 2022

work page 2022
[22]

An analysis of reinforc ement learning with function approximation,

F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforc ement learning with function approximation,” in Proceedings of the 25th international conference on Machin e learning , 2008, pp. 664–671

work page 2008
[23]

Bhatnagar, H

S. Bhatnagar, H. Prasad, and L. Prashanth, Stochastic recursive algorithms for optimization: simultaneous perturbation methods . Springer, 2012, vol. 434

work page 2012
[24]

Reinforceme nt learning with deep energy- based policies,

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforceme nt learning with deep energy- based policies,” in International conference on machine learning , 2017, pp. 1352–1361

work page 2017
[25]

Revisiting the softmax Bellman o perator: New beneﬁts and new perspective,

Z. Song, R. Parr, and L. Carin, “Revisiting the softmax Bellman o perator: New beneﬁts and new perspective,” in International conference on machine learning , 2019, pp. 5916–5925

work page 2019
[26]

Reinforcemen t learning with dynamic Boltzmann softmax updates,

L. Pan, Q. Cai, Q. Meng, W. Chen, and L. Huang, “Reinforcemen t learning with dynamic Boltzmann softmax updates,” in International Joint Conference on Artiﬁcial Intelligence , 2020, pp. 1992–1998

work page 2020
[27]

An alternative softmax operator for reinforcement learning,

K. Asadi and M. L. Littman, “An alternative softmax operator for reinforcement learning,” in International Conference on Machine Learning , 2017, pp. 243–252

work page 2017
[28]

Smoothed Q-learning,

D. Barber, “Smoothed Q-learning,” arXiv preprint arXiv:2303.08631 , 2023

work page arXiv 2023
[29]

An analog scheme for ﬁxed point computation. i. theory,

V. S. Borkar and K. Soumyanatha, “An analog scheme for ﬁxed point computation. i. theory,” IEEE Transactions on Circuits and Systems I: Fundamental Th eory and Applications , vol. 44, no. 4, pp. 351–355, 1997

work page 1997
[30]

Liberzon, Switching in systems and control

D. Liberzon, Switching in systems and control . Springer Science & Business Media, 2003

work page 2003
[31]

Uniﬁed ﬁnite-time error analysis of soft q -learning,

N. Jeong and D. Lee, “Uniﬁed ﬁnite-time error analysis of soft q -learning,” Neurocomputing, vol. 626, p. 129582, 2025

work page 2025
[32]

SBEED: Convergent reinforcement learning with nonlinear function approximation,

B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song , “SBEED: Convergent reinforcement learning with nonlinear function approximation,” in International conference on machine learning , 2018, pp. 1125–1134

work page 2018
[33]

On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

B. Gao and L. Pavel, “On the properties of the softmax functio n with application in game theory and reinforcement learning,” arXiv preprint arXiv:1704.00805 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Nonlinear systems,

H. K. Khalil, “Nonlinear systems,” Upper Saddle River , 2002. 14

work page 2002
[35]

V. S. Borkar, Stochastic approximation: a dynamical systems viewpoint . Springer, 2009, vol. 48

work page 2009
[36]

Finite-time analysis of asynchronous q-lea rning under diminishing step-size from control-theoretic view,

H.-D. Lim and D. Lee, “Finite-time analysis of asynchronous q-lea rning under diminishing step-size from control-theoretic view,” IEEE Access, 2024

work page 2024
[37]

Kushner and G

H. Kushner and G. G. Yin, Stochastic approximation and recursive algorithms and app lica- tions. Springer Science & Business Media, 2003, vol. 35

work page 2003
[38]

A stochastic approximation method,

H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951

work page 1951
[39]

Note on the derivatives with respect to a param eter of the solutions of a system of diﬀerential equations,

T. H. Gronwall, “Note on the derivatives with respect to a param eter of the solutions of a system of diﬀerential equations,” Annals of Mathematics , pp. 292–296, 1919. 15 Appendix A Basics of nonlinear system theory In this paper, we will frequently encounter several notions from n onlinear system theory for the ODE analysis. Let us consider the general ...

work page 1919
[41]

There exists a unique globally asymptotically stable equ ilibrium θe ∈ Rn for the ODE ˙xt = f (xt), i.e., xt → θe as t → ∞

work page
[42]

There exists a function f∞ : Rn → Rn such that limc→∞ f (cx) c =f∞ (x), ∀x ∈ Rn

work page
[43]

The origin in Rn is an asymptotically stable equilibrium for the ODE ˙xt =f∞ (xt)

work page
[44]

In addition, there exists a constant C0 < ∞ such that for any initial θ0 ∈ Rn, we have E[∥εk+1∥2 2|Gk] ≤ C0(1 + ∥θk∥2 2), ∀k ≥ 0

The sequence {εk, Gk,k ≥ 1} with Gk = σ (θi,ε i,i ≤ k) is a martingale diﬀerence sequence. In addition, there exists a constant C0 < ∞ such that for any initial θ0 ∈ Rn, we have E[∥εk+1∥2 2|Gk] ≤ C0(1 + ∥θk∥2 2), ∀k ≥ 0

work page
[45]

(17) Lemma 2 ( [5, Borkar and Meyn theorem])

The step-sizes satisfy α k > 0, ∞∑ k=0 α k = ∞ , ∞∑ k=0 α 2 k < ∞ . (17) Lemma 2 ( [5, Borkar and Meyn theorem]) . Under Assumption 3, for any initial θ0 ∈ Rn, we have supk≥ 0 ∥θk∥2< ∞ with probability one. In addition, θk → θe as k → ∞ with probability one. The above O.D.E approach Lemma 2 has been widely used to prove conv ergence of RLs, such as synchr...

work page
[46]

The mapping f : Rn → Rn is globally Lipschitz continuous

work page
[47]

e., xt → H as t → ∞

The ODE ˙xt =f (xt) has H as its set of globally asymptotically stable equilibria, i. e., xt → H as t → ∞

work page
[48]

The iterate deﬁned by (16) remain bounded with probability one, i.e., sup k≥ 0 ∥θk∥2< ∞ with probability one

work page
[49]

In addition, there exists a constant C0 < ∞ such that for any initial θ0 ∈ Rn, we have E[∥εk+1∥2 2|Gk] ≤ C0(1 + ∥θk∥2 2), ∀k ≥ 0 with probability one

The sequence {εk, Gk,k ≥ 1} with Gk = σ (θi,ε i,i ≤ k) is a martingale diﬀerence sequence. In addition, there exists a constant C0 < ∞ such that for any initial θ0 ∈ Rn, we have E[∥εk+1∥2 2|Gk] ≤ C0(1 + ∥θk∥2 2), ∀k ≥ 0 with probability one

work page
[50]

Lemma 3 ( [38, Robbins and Monro theorem])

The step-sizes satisfy α k > 0, ∑∞ k=0α k = ∞ , ∑∞ k=0α 2 k < ∞ . Lemma 3 ( [38, Robbins and Monro theorem]) . Under Assumption 3, for any initial θ0 ∈ Rn, supk≥ 0 ∥θk∥< ∞ with probability one. In addition, θk → θe as k → ∞ with probability one. 17 Robbins and Monro theorem [38] has been developed earlier than Bor kar and Meyn theorem [5]. The main diﬀere...

work page
[51]

w1/p min∥x∥∞ ≤ ∥ x∥p,w ≤ n1/p w1/p max∥x∥∞

work page
[52]

w1/p min∥x∥p ≤ ∥ x∥p,w ≤ w1/p max∥x∥p

work page
[53]

wmin∥x∥∞ ≤ ∥ x∥∞ ,w ≤ wmax∥x∥∞

work page
[54]

∥x∥∞ ≤ ∥ x∥p ≤ n1/p ∥x∥∞ wherewmin := min i∈{ 1, 2,...,n }wi and wmax := max i∈{ 1, 2,...,n }wi. Proof. In the ﬁrst statement, the lower bound can be obtained through ∥x∥p,w ≥ (wj |xj|p)1/p =w1/p j |xj| ≥ w1/p min|xj |, for any j ∈ { 1, 2,...,n }. By taking the maximum over j, we have ∥x∥p,w ≥ w1/p min∥x∥∞ . For the upper bound, one gets ∥x∥p,w = ( n∑ i=1...

work page
[55]

hmax(x) ≤ hλ lse(x) ≤ hmax(x) + ln(n) λ

work page
[56]

1 λ ln (1 n ) +hmax(x) ≤ hλ mm(x) ≤ hmax(x)

work page
[57]

hmax(x) − ln(n) λ ≤ hλ bz(x) ≤ hmax(x)

work page
[58]

hλ bz(x) ≤ hλ lse(x) ≤ hλ bz(x) + 1 λ ln(n) for any λ> 0. Proof. For convenience, let us deﬁne In := {1, 2,...,n }. Then, noting exp (maxi∈I nxi) ≤ ∑ i∈I n exi ≤ n exp (maxi∈I nxi), (18) we have ln (exp (maxi∈I nxi)) ≤ ln ( ∑ i∈I n e xi ) ≤ ln (n exp (maxi∈I nxi)). Replacing x with λx and dividing by λ, the ﬁrst statement follows. The second statement fol...

work page
[59]

8 0 . 0 0 . 1 0 . 1     , P 2 =    

work page
[60]

1 0 . 0 0 . 8 0 . 1     , whereP1 andP2 represent the state transition probability matrices under actions a = 1 and a = 2, respectively. The temperature parameter is set to λ = 0. 1. The simulation results for the continuous-time ODE dynamics are illust rated in Figure 3, where the ﬁrst three rows show the trajectories for the standard max , LSE, and ...

work page
[61]

Similar to the ODE case, the ﬁrst three rows (max, L SE, and mellowmax) show that the trajectories converge toward their ﬁxe d points as proved in Theorem 5

01 over 100 , 000 iterations. Similar to the ODE case, the ﬁrst three rows (max, L SE, and mellowmax) show that the trajectories converge toward their ﬁxe d points as proved in Theorem 5. We note that the trajectory plots are obtained by projecting the dynamics onto a two-dimensional plane for visualization, since the true Qk is six-dimensional. In the fo...

work page 2000

[1] [1]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction . MIT Press, 1998

work page 1998

[2] [2]

Human-level control through deep rein- forcement learning,

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. B ellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep rein- forcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015

work page 2015

[3] [3]

Q-learning,

C. J. Watkins and P. Dayan, “Q-learning,” Machine learning , vol. 8, no. 3-4, pp. 279–292, 1992. 12

work page 1992

[4] [4]

Convergence of st ochastic iterative dynamic programming algorithms,

T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of st ochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems , 1994, pp. 703–710

work page 1994

[5] [5]

The ODE method for convergence of stochastic approximation and reinforcement learning,

V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization , vol. 38, no. 2, pp. 447–469, 2000

work page 2000

[6] [6]

A uniﬁed switching system perspective and con vergence analysis of Q- learning algorithms,

D. Lee and N. He, “A uniﬁed switching system perspective and con vergence analysis of Q- learning algorithms,” in 34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

work page 2020

[7] [7]

Asynchronous stochastic approximation and Q- learning,

J. N. Tsitsiklis, “Asynchronous stochastic approximation and Q- learning,” Machine learning, vol. 16, no. 3, pp. 185–202, 1994

work page 1994

[8] [8]

Conv ergence results for single-step on-policy reinforcement-learning algorithms,

S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesv´ ari, “Conv ergence results for single-step on-policy reinforcement-learning algorithms,” Machine learning , vol. 38, pp. 287–308, 2000

work page 2000

[9] [9]

The asymptotic convergence-rate of Q-lear ning,

C. Szepesv´ ari, “The asymptotic convergence-rate of Q-lear ning,” in Advances in Neural In- formation Processing Systems , 1998, pp. 1064–1070

work page 1998

[10] [10]

Learning rates for Q-learning,

E. Even-Dar and Y. Mansour, “Learning rates for Q-learning,” Journal of machine learning Research, vol. 5, no. Dec, pp. 1–25, 2003

work page 2003

[11] [11]

Error bounds for constant step- size Q-learning,

C. L. Beck and R. Srikant, “Error bounds for constant step- size Q-learning,” Systems & Control letters , vol. 61, no. 12, pp. 1203–1208, 2012

work page 2012

[12] [12]

Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

M. J. Wainwright, “Stochastic approximation with cone-contra ctive operators: Sharp ℓ∞ - bounds for Q-learning,” arXiv preprint arXiv:1905.06265 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[13] [13]

Finite-time analysis of asynchronous sto chastic approximation and Q-learning,

G. Qu and A. Wierman, “Finite-time analysis of asynchronous sto chastic approximation and Q-learning,” arXiv preprint arXiv:2002.00260 , 2020

work page arXiv 2002

[14] [14]

Sample complexity of asy nchronous Q-learning: Sharper analysis and variance reduction,

G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, “Sample complexity of asy nchronous Q-learning: Sharper analysis and variance reduction,” arXiv preprint arXiv:2006.03041 , 2020

work page arXiv 2006

[15] [15]

A L yapunov theory for ﬁnite- sample guarantees of asynchronous Q-learning and TD-learning va riants,

Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, “A L yapunov theory for ﬁnite- sample guarantees of asynchronous Q-learning and TD-learning va riants,” arXiv preprint arXiv:2102.01567, 2021

work page arXiv 2021

[16] [16]

Finite-sample analysis of contractive stochastic approxim ation using smooth convex envelopes,

——, “Finite-sample analysis of contractive stochastic approxim ation using smooth convex envelopes,” Advances in Neural Information Processing Systems , vol. 33, pp. 8223–8234, 2020

work page 2020

[17] [17]

Final iteration convergence of Q-learning: Switching s ystem approach,

D. Lee, “Final iteration convergence of Q-learning: Switching s ystem approach,” IEEE Trans- actions on Automatic Control , 2024

work page 2024

[18] [18]

A convergento(n) temporal-diﬀerence algorithm for oﬀ-policy learning with linear function approximation,

R. S. Sutton, H. R. Maei, and C. Szepesv´ ari, “A convergento(n) temporal-diﬀerence algorithm for oﬀ-policy learning with linear function approximation,” in Advances in neural information processing systems, 2009, pp. 1609–1616. 13

work page 2009

[19] [19]

Fast gradient-descent methods for temporal-diﬀerence learnin g with linear function approx- imation,

R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C . Szepesv´ ari, and E. Wiewiora, “Fast gradient-descent methods for temporal-diﬀerence learnin g with linear function approx- imation,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 993–1000

work page 2009

[20] [20]

Gradient temporal- diﬀerence learning with regularized corrections,

S. Ghiassian, A. Patterson, S. Garg, D. Gupta, A. White, and M . White, “Gradient temporal- diﬀerence learning with regularized corrections,” in International Conference on Machine Learning, 2020, pp. 3524–3534

work page 2020

[21] [21]

New versions of gradien t temporal diﬀerence learning,

D. Lee, H.-D. Lim, J. Park, and O. Choi, “New versions of gradien t temporal diﬀerence learning,” IEEE Transactions on Automatic Control , 2022

work page 2022

[22] [22]

An analysis of reinforc ement learning with function approximation,

F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforc ement learning with function approximation,” in Proceedings of the 25th international conference on Machin e learning , 2008, pp. 664–671

work page 2008

[23] [23]

Bhatnagar, H

S. Bhatnagar, H. Prasad, and L. Prashanth, Stochastic recursive algorithms for optimization: simultaneous perturbation methods . Springer, 2012, vol. 434

work page 2012

[24] [24]

Reinforceme nt learning with deep energy- based policies,

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforceme nt learning with deep energy- based policies,” in International conference on machine learning , 2017, pp. 1352–1361

work page 2017

[25] [25]

Revisiting the softmax Bellman o perator: New beneﬁts and new perspective,

Z. Song, R. Parr, and L. Carin, “Revisiting the softmax Bellman o perator: New beneﬁts and new perspective,” in International conference on machine learning , 2019, pp. 5916–5925

work page 2019

[26] [26]

Reinforcemen t learning with dynamic Boltzmann softmax updates,

L. Pan, Q. Cai, Q. Meng, W. Chen, and L. Huang, “Reinforcemen t learning with dynamic Boltzmann softmax updates,” in International Joint Conference on Artiﬁcial Intelligence , 2020, pp. 1992–1998

work page 2020

[27] [27]

An alternative softmax operator for reinforcement learning,

K. Asadi and M. L. Littman, “An alternative softmax operator for reinforcement learning,” in International Conference on Machine Learning , 2017, pp. 243–252

work page 2017

[28] [28]

Smoothed Q-learning,

D. Barber, “Smoothed Q-learning,” arXiv preprint arXiv:2303.08631 , 2023

work page arXiv 2023

[29] [29]

An analog scheme for ﬁxed point computation. i. theory,

V. S. Borkar and K. Soumyanatha, “An analog scheme for ﬁxed point computation. i. theory,” IEEE Transactions on Circuits and Systems I: Fundamental Th eory and Applications , vol. 44, no. 4, pp. 351–355, 1997

work page 1997

[30] [30]

Liberzon, Switching in systems and control

D. Liberzon, Switching in systems and control . Springer Science & Business Media, 2003

work page 2003

[31] [31]

Uniﬁed ﬁnite-time error analysis of soft q -learning,

N. Jeong and D. Lee, “Uniﬁed ﬁnite-time error analysis of soft q -learning,” Neurocomputing, vol. 626, p. 129582, 2025

work page 2025

[32] [32]

SBEED: Convergent reinforcement learning with nonlinear function approximation,

B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song , “SBEED: Convergent reinforcement learning with nonlinear function approximation,” in International conference on machine learning , 2018, pp. 1125–1134

work page 2018

[33] [33]

On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

B. Gao and L. Pavel, “On the properties of the softmax functio n with application in game theory and reinforcement learning,” arXiv preprint arXiv:1704.00805 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Nonlinear systems,

H. K. Khalil, “Nonlinear systems,” Upper Saddle River , 2002. 14

work page 2002

[35] [35]

V. S. Borkar, Stochastic approximation: a dynamical systems viewpoint . Springer, 2009, vol. 48

work page 2009

[36] [36]

Finite-time analysis of asynchronous q-lea rning under diminishing step-size from control-theoretic view,

H.-D. Lim and D. Lee, “Finite-time analysis of asynchronous q-lea rning under diminishing step-size from control-theoretic view,” IEEE Access, 2024

work page 2024

[37] [37]

Kushner and G

H. Kushner and G. G. Yin, Stochastic approximation and recursive algorithms and app lica- tions. Springer Science & Business Media, 2003, vol. 35

work page 2003

[38] [38]

A stochastic approximation method,

H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951

work page 1951

[39] [39]

Note on the derivatives with respect to a param eter of the solutions of a system of diﬀerential equations,

T. H. Gronwall, “Note on the derivatives with respect to a param eter of the solutions of a system of diﬀerential equations,” Annals of Mathematics , pp. 292–296, 1919. 15 Appendix A Basics of nonlinear system theory In this paper, we will frequently encounter several notions from n onlinear system theory for the ODE analysis. Let us consider the general ...

work page 1919

[40] [41]

There exists a unique globally asymptotically stable equ ilibrium θe ∈ Rn for the ODE ˙xt = f (xt), i.e., xt → θe as t → ∞

work page

[41] [42]

There exists a function f∞ : Rn → Rn such that limc→∞ f (cx) c =f∞ (x), ∀x ∈ Rn

work page

[42] [43]

The origin in Rn is an asymptotically stable equilibrium for the ODE ˙xt =f∞ (xt)

work page

[43] [44]

In addition, there exists a constant C0 < ∞ such that for any initial θ0 ∈ Rn, we have E[∥εk+1∥2 2|Gk] ≤ C0(1 + ∥θk∥2 2), ∀k ≥ 0

The sequence {εk, Gk,k ≥ 1} with Gk = σ (θi,ε i,i ≤ k) is a martingale diﬀerence sequence. In addition, there exists a constant C0 < ∞ such that for any initial θ0 ∈ Rn, we have E[∥εk+1∥2 2|Gk] ≤ C0(1 + ∥θk∥2 2), ∀k ≥ 0

work page

[44] [45]

(17) Lemma 2 ( [5, Borkar and Meyn theorem])

The step-sizes satisfy α k > 0, ∞∑ k=0 α k = ∞ , ∞∑ k=0 α 2 k < ∞ . (17) Lemma 2 ( [5, Borkar and Meyn theorem]) . Under Assumption 3, for any initial θ0 ∈ Rn, we have supk≥ 0 ∥θk∥2< ∞ with probability one. In addition, θk → θe as k → ∞ with probability one. The above O.D.E approach Lemma 2 has been widely used to prove conv ergence of RLs, such as synchr...

work page

[45] [46]

The mapping f : Rn → Rn is globally Lipschitz continuous

work page

[46] [47]

e., xt → H as t → ∞

The ODE ˙xt =f (xt) has H as its set of globally asymptotically stable equilibria, i. e., xt → H as t → ∞

work page

[47] [48]

The iterate deﬁned by (16) remain bounded with probability one, i.e., sup k≥ 0 ∥θk∥2< ∞ with probability one

work page

[48] [49]

In addition, there exists a constant C0 < ∞ such that for any initial θ0 ∈ Rn, we have E[∥εk+1∥2 2|Gk] ≤ C0(1 + ∥θk∥2 2), ∀k ≥ 0 with probability one

The sequence {εk, Gk,k ≥ 1} with Gk = σ (θi,ε i,i ≤ k) is a martingale diﬀerence sequence. In addition, there exists a constant C0 < ∞ such that for any initial θ0 ∈ Rn, we have E[∥εk+1∥2 2|Gk] ≤ C0(1 + ∥θk∥2 2), ∀k ≥ 0 with probability one

work page

[49] [50]

Lemma 3 ( [38, Robbins and Monro theorem])

The step-sizes satisfy α k > 0, ∑∞ k=0α k = ∞ , ∑∞ k=0α 2 k < ∞ . Lemma 3 ( [38, Robbins and Monro theorem]) . Under Assumption 3, for any initial θ0 ∈ Rn, supk≥ 0 ∥θk∥< ∞ with probability one. In addition, θk → θe as k → ∞ with probability one. 17 Robbins and Monro theorem [38] has been developed earlier than Bor kar and Meyn theorem [5]. The main diﬀere...

work page

[50] [51]

w1/p min∥x∥∞ ≤ ∥ x∥p,w ≤ n1/p w1/p max∥x∥∞

work page

[51] [52]

w1/p min∥x∥p ≤ ∥ x∥p,w ≤ w1/p max∥x∥p

work page

[52] [53]

wmin∥x∥∞ ≤ ∥ x∥∞ ,w ≤ wmax∥x∥∞

work page

[53] [54]

∥x∥∞ ≤ ∥ x∥p ≤ n1/p ∥x∥∞ wherewmin := min i∈{ 1, 2,...,n }wi and wmax := max i∈{ 1, 2,...,n }wi. Proof. In the ﬁrst statement, the lower bound can be obtained through ∥x∥p,w ≥ (wj |xj|p)1/p =w1/p j |xj| ≥ w1/p min|xj |, for any j ∈ { 1, 2,...,n }. By taking the maximum over j, we have ∥x∥p,w ≥ w1/p min∥x∥∞ . For the upper bound, one gets ∥x∥p,w = ( n∑ i=1...

work page

[54] [55]

hmax(x) ≤ hλ lse(x) ≤ hmax(x) + ln(n) λ

work page

[55] [56]

1 λ ln (1 n ) +hmax(x) ≤ hλ mm(x) ≤ hmax(x)

work page

[56] [57]

hmax(x) − ln(n) λ ≤ hλ bz(x) ≤ hmax(x)

work page

[57] [58]

hλ bz(x) ≤ hλ lse(x) ≤ hλ bz(x) + 1 λ ln(n) for any λ> 0. Proof. For convenience, let us deﬁne In := {1, 2,...,n }. Then, noting exp (maxi∈I nxi) ≤ ∑ i∈I n exi ≤ n exp (maxi∈I nxi), (18) we have ln (exp (maxi∈I nxi)) ≤ ln ( ∑ i∈I n e xi ) ≤ ln (n exp (maxi∈I nxi)). Replacing x with λx and dividing by λ, the ﬁrst statement follows. The second statement fol...

work page

[58] [59]

8 0 . 0 0 . 1 0 . 1     , P 2 =    

work page

[59] [60]

1 0 . 0 0 . 8 0 . 1     , whereP1 andP2 represent the state transition probability matrices under actions a = 1 and a = 2, respectively. The temperature parameter is set to λ = 0. 1. The simulation results for the continuous-time ODE dynamics are illust rated in Figure 3, where the ﬁrst three rows show the trajectories for the standard max , LSE, and ...

work page

[60] [61]

Similar to the ODE case, the ﬁrst three rows (max, L SE, and mellowmax) show that the trajectories converge toward their ﬁxe d points as proved in Theorem 5

01 over 100 , 000 iterations. Similar to the ODE case, the ﬁrst three rows (max, L SE, and mellowmax) show that the trajectories converge toward their ﬁxe d points as proved in Theorem 5. We note that the trajectory plots are obtained by projecting the dynamics onto a two-dimensional plane for visualization, since the true Qk is six-dimensional. In the fo...

work page 2000