Sign-Separated Finite-Time Error Analysis of Q-Learning

Donghwan Lee

arxiv: 2605.16103 · v1 · pith:GEZJ44UMnew · submitted 2026-05-15 · 💻 cs.AI

Sign-Separated Finite-Time Error Analysis of Q-Learning

Donghwan Lee This is my paper

Pith reviewed 2026-05-20 17:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords Q-learningfinite-time analysiserror boundsconstant step sizeswitching systemsoverestimation biassign separationreinforcement learning

0 comments

The pith

Q-learning errors split by sign reveal that negative components are bounded by a fixed optimal-policy linear system no slower than the positive switching bound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. It starts from the switching-system representation of the algorithm and decomposes the error into its componentwise negative and positive parts. The negative part is shown to be dominated by a lower comparison linear time-invariant system tied to a fixed optimal policy, while the positive part is controlled by a linear switching system. This decomposition identifies a max-induced asymmetry that connects directly to the known overestimation tendency in Q-learning, because the Bellman maximum selects and propagates positive errors but allows negative errors to be compared against the optimal policy. The resulting bounds are given for both deterministic and stochastic recursions and demonstrate that the negative-side certificate is at least as fast as the positive-side one and can be strictly faster.

Core claim

By decomposing the Q-learning error into negative and positive components from the switching-system representation, the negative component is dominated by a lower comparison LTI system associated with a fixed optimal policy whereas the positive component is controlled by a linear switching system; the resulting bounds establish that the negative-side LTI certificate is no slower than the positive-side switching certificate and may produce a faster exponential envelope, revealing a max-induced asymmetry in the error dynamics that explains overestimation.

What carries the argument

The sign-separated error decomposition that isolates negative errors under a fixed optimal-policy LTI lower comparison while positive errors remain under switching dynamics.

If this is right

Finite-time error bounds hold for the deterministic constant-step-size Q-learning recursion.
Finite-time error bounds also hold for the stochastic constant-step-size Q-learning recursion.
The identified asymmetry directly accounts for the overestimation bias because positive action-wise errors are selected by the Bellman maximum.
The negative-side LTI bound can supply a strictly faster exponential envelope than the positive-side switching bound in some cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sign-separation idea could be applied to analyze finite-time behavior in other value-based methods that rely on the max operator.
Step-size schedules might be designed to exploit the faster negative-side envelope while controlling the slower positive-side dynamics.
The asymmetry suggests that bias-correction techniques could focus specifically on damping the propagation of positive errors.
Hybrid analysis combining LTI lower bounds with switching upper bounds may extend to related stochastic approximation algorithms beyond Q-learning.

Load-bearing premise

The error must be decomposable into componentwise negative and positive parts such that the negative part is dominated by a lower comparison linear time-invariant system associated with a fixed optimal policy.

What would settle it

A counterexample recursion or numerical simulation in which the observed convergence rate of the negative error component is slower than the exponential envelope given by the positive-side switching bound would falsify the central comparison claim.

Figures

Figures reproduced from arXiv: 2605.16103 by Donghwan Lee.

read the original abstract

This paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. Starting from the switching-system representation, the error is decomposed into its componentwise negative and positive parts. The negative part is dominated by a lower comparison linear time-invariant (LTI) system associated with a fixed optimal policy, whereas the positive part is controlled by a linear switching system. The resulting bounds show that the negative-side LTI certificate is no slower than the positive-side switching certificate and may produce a faster exponential envelope. The analysis identifies a max-induced asymmetry in Q-learning error dynamics. This asymmetry is connected to overestimation: positive action-wise errors can be selected and propagated by the Bellman maximum, whereas negative errors admit an optimal-policy lower comparison. Finite-time bounds are provided for both deterministic and stochastic constant-step-size recursions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The sign-separated decomposition gives a clean way to bound constant-step Q-learning errors by splitting positive and negative parts, but the negative LTI domination needs explicit checks against mode switches.

read the letter

The main thing here is a sign-separated finite-time error analysis for constant step-size Q-learning. Errors get split into componentwise positive and negative parts from the switching-system starting point. Negative errors are dominated by a lower-comparison LTI system tied to the fixed optimal policy, while positive errors run through a switching system. The bounds show the negative side is no slower than the positive side and can sometimes give a faster exponential envelope. This also ties the asymmetry to overestimation, since the max operator can pick up and spread positive errors but negative ones allow the optimal-policy comparison. Bounds are given for both deterministic and stochastic cases.

Referee Report

2 major / 2 minor

Summary. The paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. Starting from the switching-system representation of the Q-learning recursion, the error is decomposed componentwise into positive and negative parts. The negative error is shown to be dominated by a lower-comparison LTI system driven by the fixed optimal policy, while the positive error is bounded via a linear switching system. The resulting bounds establish that the negative-side LTI certificate is no slower than the positive-side switching certificate and can yield a faster exponential envelope. The analysis highlights a max-induced asymmetry in the error dynamics, linking it to overestimation bias, and supplies explicit finite-time bounds for both deterministic and stochastic constant-step-size cases.

Significance. If the central domination arguments hold, the work offers a conceptually useful refinement of finite-time Q-learning analysis by exploiting sign separation to expose an asymmetry not visible in standard uniform bounds. The connection to overestimation and the explicit comparison between LTI and switching certificates could inform tighter rate analyses or algorithm design in value-based RL. The provision of both deterministic and stochastic bounds strengthens the practical relevance of the theoretical contribution.

major comments (2)

[switching-system representation and negative-error domination argument] The central claim that the negative error component is dominated by the LTI system associated with the fixed optimal policy (as described in the abstract and the starting point from the switching-system representation) requires explicit verification that the domination inequality is preserved under mode switches. When positive errors temporarily make a suboptimal action maximal, the effective transition matrix seen by the negative components changes; the manuscript must show that the comparison still holds in this case, e.g., via a specific lemma or theorem establishing invariance of the inequality under the argmax selection.
[stochastic recursion bounds] The finite-time bounds for the stochastic case rely on the same sign-separated decomposition. It is unclear whether the concentration or martingale arguments used to control the noise term interact with the mode-switching induced by positive errors; a concrete statement of how the stochastic perturbation is handled while preserving the LTI lower comparison would strengthen the result.

minor comments (2)

[error decomposition] Notation for the positive and negative error vectors (e.g., e^+ and e^-) should be introduced with an explicit definition immediately after the decomposition step to avoid ambiguity in later sections.
[abstract and main theorem statement] The abstract states that the negative-side LTI certificate 'may produce a faster exponential envelope'; a brief remark on the conditions under which strict improvement occurs would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The sign-separated decomposition is intended to expose an asymmetry in Q-learning error dynamics that is not captured by uniform bounds. We address each major comment below and will revise the manuscript to provide the requested clarifications and explicit statements.

read point-by-point responses

Referee: The central claim that the negative error component is dominated by the LTI system associated with the fixed optimal policy (as described in the abstract and the starting point from the switching-system representation) requires explicit verification that the domination inequality is preserved under mode switches. When positive errors temporarily make a suboptimal action maximal, the effective transition matrix seen by the negative components changes; the manuscript must show that the comparison still holds in this case, e.g., via a specific lemma or theorem establishing invariance of the inequality under the argmax selection.

Authors: We agree that an explicit invariance statement would strengthen the presentation. The current proof proceeds by induction, showing that the negative-error vector remains componentwise above the trajectory of the optimal-policy LTI system for any sequence of switches; the key step uses the fact that any deviation from the optimal policy can only increase the selected Q-value and therefore cannot violate the lower comparison for the negative components. Nevertheless, we will add a dedicated lemma (new Lemma 3.4) that isolates this invariance property under arbitrary argmax-induced mode sequences and proves it directly from the componentwise ordering. The revised manuscript will place this lemma immediately after the switching-system representation and before the main domination theorem. revision: yes
Referee: The finite-time bounds for the stochastic case rely on the same sign-separated decomposition. It is unclear whether the concentration or martingale arguments used to control the noise term interact with the mode-switching induced by positive errors; a concrete statement of how the stochastic perturbation is handled while preserving the LTI lower comparison would strengthen the result.

Authors: We thank the referee for noting this point. In the stochastic analysis the noise is treated as an additive martingale difference sequence that is independent of the sign separation at each step. The LTI lower comparison is applied to the deterministic drift, while the stochastic deviation is bounded uniformly via a vector martingale inequality that holds regardless of the (finite) number of possible switching modes. To make the argument transparent we will insert a short proposition (new Proposition 4.3) that states the high-probability bound on the cumulative noise while explicitly preserving the componentwise domination by the optimal-policy LTI system. A brief remark will also be added explaining that the worst-case switching sequence is already accounted for by taking the supremum inside the concentration bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation starts from external switching-system representation

full rationale

The paper begins its analysis from the switching-system representation of Q-learning as a given starting point and then decomposes the error into componentwise negative and positive parts. The negative component is dominated by a lower-comparison LTI system associated with the fixed optimal policy, while the positive component is bounded via the switching dynamics. The claimed comparison of exponential envelopes between the negative-side LTI certificate and the positive-side switching certificate is obtained by direct analysis of these two systems' properties. No step equates a derived bound to its input by construction, renames a fitted quantity as a prediction, or relies on a load-bearing self-citation whose validity is assumed rather than independently verified. The central asymmetry result follows from the structure of the Bellman max operator applied to the decomposed errors and remains self-contained against the external representation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach rests on the switching-system representation of Q-learning dynamics as the starting point for decomposition; no free parameters, new entities, or additional axioms are indicated in the abstract.

axioms (1)

domain assumption Switching-system representation of the Q-learning error dynamics
Explicitly stated as the starting point for the sign-separated decomposition in the abstract.

pith-pipeline@v0.9.0 · 5657 in / 1151 out tokens · 62219 ms · 2026-05-20T17:58:17.102105+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the negative part is dominated by a lower comparison linear time-invariant (LTI) system associated with a fixed optimal policy, whereas the positive part is controlled by a linear switching system
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ek = e+_k − e−_k ... e−_{k+1} ≤ A⋆_− e−_k

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

[1]

Error bounds for constant step-size Q-learning

Carolyn L Beck and Rayadurgam Srikant. Error bounds for constant step-size Q-learning. Systems & control letters, 61(12):1203–1208, 2012

work page 2012
[2]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-dynamic programming. Athena Scientific Belmont, MA, 1996

work page 1996
[3]

Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

Vincent D Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005. 18

work page 2005
[4]

The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

work page 2000
[5]

A L yapunov theory for ﬁnite- sample guarantees of asynchronous Q-learning and TD-learning va riants,

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants.arXiv preprint arXiv:2102.01567, 2021

work page arXiv 2021
[6]

Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

work page 2003
[7]

Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

work page 1993
[8]

Springer Science & Business Media, 2009

Raphaël Jungers.The joint spectral radius: Theory and applications, volume 385. Springer Science & Business Media, 2009

work page 2009
[9]

Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

work page 1998
[10]

George Yin.Stochastic approximation and recursive algorithms and applications, volume 35

Harold Kushner and G. George Yin.Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003

work page 2003
[11]

Final iteration convergence bound of Q-learning: Switching system approach

Donghwan Lee. Final iteration convergence bound of Q-learning: Switching system approach. IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

work page 2024
[12]

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning.arXiv preprint arXiv:2604.19569, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

A unified switching system perspective and convergence analysis of Q-learning algorithms

Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms. In34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

work page 2020
[14]

A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

work page 2023
[15]

Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021

work page 2021
[16]

Springer Science & Business Media, 2003

Daniel Liberzon.Switching in systems and control. Springer Science & Business Media, 2003

work page 2003
[17]

Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024

Han-Dong Lim and Donghwan Lee. Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024

work page 2024
[18]

Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009

Hai Lin and Panos J Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009

work page 2009
[19]

Puterman.Markov decision processes: Discrete stochastic dynamic programming

Martin L. Puterman.Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[20]

Finite-time analysis of asynchronous stochastic approximation and Q-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on learning theory, pages 3185–3205, 2020. 19

work page 2020
[21]

A note on the joint spectral radius.Indag

Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indag. Math, 22(4): 379–381, 1960

work page 1960
[22]

Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

work page 2007
[23]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement learning: An introduction. MIT Press, 1998

work page 1998
[24]

The asymptotic convergence-rate of Q-learning

Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. InAdvances in Neural Information Processing Systems, pages 1064–1070, 1998

work page 1998
[25]

Issues in using function approximation for reinforcement learning

Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Michael Mozer, Paul Smolensky, David Touretzky, Jeffrey Elman, and Andreas Weigend, editors,Proceedings of the 1993 Connectionist Models Summer School, pages 255–263. Lawrence Erlbaum, 1993

work page 1993
[26]

Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994

work page 1994
[27]

John N Tsitsiklis and Vincent D Blondel. The lyapunov exponent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to approximate.Mathematics of Control, Signals and Systems, 10(1):31–40, 1997

work page 1997
[28]

Double q-learning

Hado van Hasselt. Double q-learning. InAdvances in Neural Information Processing Systems, volume 23, pages 2613–2622. Curran Associates, Inc., 2010

work page 2010
[29]

Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpl∞- bounds for Q-learning.arXiv preprint arXiv:1905.06265, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[30]

Q-learning.Machine learning, 8(3):279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3):279–292, 1992. 20 Appendix A Auxiliary Lyapunov constructions The following lemma records the product-defined Lyapunov construction used in the finite-time arguments. Lemma 7.Let H:={A 1, A2, . . . , AM } ⊂R m×m, ρ:=ρ(H), and fix anyε >0such that βε :=ρ+ε∈(0,1). For a sequence of s...

work page 1992
[31]

Statement 2 follows from homogeneity of the Euclidean norm and from the fact thatvt+1 ε is obtained fromvt ε by adding one nonnegative term

This proves Statement 3. Statement 2 follows from homogeneity of the Euclidean norm and from the fact thatvt+1 ε is obtained fromvt ε by adding one nonnegative term. For Statement 1, fixi∈ { 1, . . . , M}. For everyk≥ 0, each productAσAi, with σ∈ { 1, . . . , M}k, is a product of lengthk+ 1fromH. Therefore β−2 ε vt ε(Aix) = tX k=0 β−2(k+1) ε max σ∈{1,...,...

work page
[32]

Taking conditional expectations and applying Lemma 8 proves the claim. C.2 Deterministic comparison lemmas For anyQande=Q−Q ∗, the Bellman-max term can be expanded state by state as VQ(s)−V ∗(s) = max a∈A {Q∗(s, a) +e(s, a)} −V ∗(s) = max a∈A {e(s, a)−A ∗(s, a)}. Ifπ∈Θ ∗, thenA ∗(s, π(s)) = 0, and therefore VQ −V ∗ −Π πe≥0. For the upper comparison, ifπ+(...

work page

[1] [1]

Error bounds for constant step-size Q-learning

Carolyn L Beck and Rayadurgam Srikant. Error bounds for constant step-size Q-learning. Systems & control letters, 61(12):1203–1208, 2012

work page 2012

[2] [2]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-dynamic programming. Athena Scientific Belmont, MA, 1996

work page 1996

[3] [3]

Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

Vincent D Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005. 18

work page 2005

[4] [4]

The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

work page 2000

[5] [5]

A L yapunov theory for ﬁnite- sample guarantees of asynchronous Q-learning and TD-learning va riants,

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants.arXiv preprint arXiv:2102.01567, 2021

work page arXiv 2021

[6] [6]

Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

work page 2003

[7] [7]

Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

work page 1993

[8] [8]

Springer Science & Business Media, 2009

Raphaël Jungers.The joint spectral radius: Theory and applications, volume 385. Springer Science & Business Media, 2009

work page 2009

[9] [9]

Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

work page 1998

[10] [10]

George Yin.Stochastic approximation and recursive algorithms and applications, volume 35

Harold Kushner and G. George Yin.Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003

work page 2003

[11] [11]

Final iteration convergence bound of Q-learning: Switching system approach

Donghwan Lee. Final iteration convergence bound of Q-learning: Switching system approach. IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

work page 2024

[12] [12]

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning.arXiv preprint arXiv:2604.19569, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

A unified switching system perspective and convergence analysis of Q-learning algorithms

Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms. In34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

work page 2020

[14] [14]

A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

work page 2023

[15] [15]

Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021

work page 2021

[16] [16]

Springer Science & Business Media, 2003

Daniel Liberzon.Switching in systems and control. Springer Science & Business Media, 2003

work page 2003

[17] [17]

Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024

Han-Dong Lim and Donghwan Lee. Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024

work page 2024

[18] [18]

Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009

Hai Lin and Panos J Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009

work page 2009

[19] [19]

Puterman.Markov decision processes: Discrete stochastic dynamic programming

Martin L. Puterman.Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014

[20] [20]

Finite-time analysis of asynchronous stochastic approximation and Q-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on learning theory, pages 3185–3205, 2020. 19

work page 2020

[21] [21]

A note on the joint spectral radius.Indag

Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indag. Math, 22(4): 379–381, 1960

work page 1960

[22] [22]

Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

work page 2007

[23] [23]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement learning: An introduction. MIT Press, 1998

work page 1998

[24] [24]

The asymptotic convergence-rate of Q-learning

Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. InAdvances in Neural Information Processing Systems, pages 1064–1070, 1998

work page 1998

[25] [25]

Issues in using function approximation for reinforcement learning

Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Michael Mozer, Paul Smolensky, David Touretzky, Jeffrey Elman, and Andreas Weigend, editors,Proceedings of the 1993 Connectionist Models Summer School, pages 255–263. Lawrence Erlbaum, 1993

work page 1993

[26] [26]

Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994

work page 1994

[27] [27]

John N Tsitsiklis and Vincent D Blondel. The lyapunov exponent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to approximate.Mathematics of Control, Signals and Systems, 10(1):31–40, 1997

work page 1997

[28] [28]

Double q-learning

Hado van Hasselt. Double q-learning. InAdvances in Neural Information Processing Systems, volume 23, pages 2613–2622. Curran Associates, Inc., 2010

work page 2010

[29] [29]

Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpl∞- bounds for Q-learning.arXiv preprint arXiv:1905.06265, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[30] [30]

Q-learning.Machine learning, 8(3):279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3):279–292, 1992. 20 Appendix A Auxiliary Lyapunov constructions The following lemma records the product-defined Lyapunov construction used in the finite-time arguments. Lemma 7.Let H:={A 1, A2, . . . , AM } ⊂R m×m, ρ:=ρ(H), and fix anyε >0such that βε :=ρ+ε∈(0,1). For a sequence of s...

work page 1992

[31] [31]

Statement 2 follows from homogeneity of the Euclidean norm and from the fact thatvt+1 ε is obtained fromvt ε by adding one nonnegative term

This proves Statement 3. Statement 2 follows from homogeneity of the Euclidean norm and from the fact thatvt+1 ε is obtained fromvt ε by adding one nonnegative term. For Statement 1, fixi∈ { 1, . . . , M}. For everyk≥ 0, each productAσAi, with σ∈ { 1, . . . , M}k, is a product of lengthk+ 1fromH. Therefore β−2 ε vt ε(Aix) = tX k=0 β−2(k+1) ε max σ∈{1,...,...

work page

[32] [32]

Taking conditional expectations and applying Lemma 8 proves the claim. C.2 Deterministic comparison lemmas For anyQande=Q−Q ∗, the Bellman-max term can be expanded state by state as VQ(s)−V ∗(s) = max a∈A {Q∗(s, a) +e(s, a)} −V ∗(s) = max a∈A {e(s, a)−A ∗(s, a)}. Ifπ∈Θ ∗, thenA ∗(s, π(s)) = 0, and therefore VQ −V ∗ −Π πe≥0. For the upper comparison, ifπ+(...

work page