pith. sign in

arxiv: 2605.16103 · v1 · pith:GEZJ44UMnew · submitted 2026-05-15 · 💻 cs.AI

Sign-Separated Finite-Time Error Analysis of Q-Learning

Pith reviewed 2026-05-20 17:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords Q-learningfinite-time analysiserror boundsconstant step sizeswitching systemsoverestimation biassign separationreinforcement learning
0
0 comments X

The pith

Q-learning errors split by sign reveal that negative components are bounded by a fixed optimal-policy linear system no slower than the positive switching bound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. It starts from the switching-system representation of the algorithm and decomposes the error into its componentwise negative and positive parts. The negative part is shown to be dominated by a lower comparison linear time-invariant system tied to a fixed optimal policy, while the positive part is controlled by a linear switching system. This decomposition identifies a max-induced asymmetry that connects directly to the known overestimation tendency in Q-learning, because the Bellman maximum selects and propagates positive errors but allows negative errors to be compared against the optimal policy. The resulting bounds are given for both deterministic and stochastic recursions and demonstrate that the negative-side certificate is at least as fast as the positive-side one and can be strictly faster.

Core claim

By decomposing the Q-learning error into negative and positive components from the switching-system representation, the negative component is dominated by a lower comparison LTI system associated with a fixed optimal policy whereas the positive component is controlled by a linear switching system; the resulting bounds establish that the negative-side LTI certificate is no slower than the positive-side switching certificate and may produce a faster exponential envelope, revealing a max-induced asymmetry in the error dynamics that explains overestimation.

What carries the argument

The sign-separated error decomposition that isolates negative errors under a fixed optimal-policy LTI lower comparison while positive errors remain under switching dynamics.

If this is right

  • Finite-time error bounds hold for the deterministic constant-step-size Q-learning recursion.
  • Finite-time error bounds also hold for the stochastic constant-step-size Q-learning recursion.
  • The identified asymmetry directly accounts for the overestimation bias because positive action-wise errors are selected by the Bellman maximum.
  • The negative-side LTI bound can supply a strictly faster exponential envelope than the positive-side switching bound in some cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sign-separation idea could be applied to analyze finite-time behavior in other value-based methods that rely on the max operator.
  • Step-size schedules might be designed to exploit the faster negative-side envelope while controlling the slower positive-side dynamics.
  • The asymmetry suggests that bias-correction techniques could focus specifically on damping the propagation of positive errors.
  • Hybrid analysis combining LTI lower bounds with switching upper bounds may extend to related stochastic approximation algorithms beyond Q-learning.

Load-bearing premise

The error must be decomposable into componentwise negative and positive parts such that the negative part is dominated by a lower comparison linear time-invariant system associated with a fixed optimal policy.

What would settle it

A counterexample recursion or numerical simulation in which the observed convergence rate of the negative error component is slower than the exponential envelope given by the positive-side switching bound would falsify the central comparison claim.

Figures

Figures reproduced from arXiv: 2605.16103 by Donghwan Lee.

Figure 1
Figure 1. Figure 1: Schematic sign-separated envelopes. The positive component is certified by the full [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

This paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. Starting from the switching-system representation, the error is decomposed into its componentwise negative and positive parts. The negative part is dominated by a lower comparison linear time-invariant (LTI) system associated with a fixed optimal policy, whereas the positive part is controlled by a linear switching system. The resulting bounds show that the negative-side LTI certificate is no slower than the positive-side switching certificate and may produce a faster exponential envelope. The analysis identifies a max-induced asymmetry in Q-learning error dynamics. This asymmetry is connected to overestimation: positive action-wise errors can be selected and propagated by the Bellman maximum, whereas negative errors admit an optimal-policy lower comparison. Finite-time bounds are provided for both deterministic and stochastic constant-step-size recursions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. Starting from the switching-system representation of the Q-learning recursion, the error is decomposed componentwise into positive and negative parts. The negative error is shown to be dominated by a lower-comparison LTI system driven by the fixed optimal policy, while the positive error is bounded via a linear switching system. The resulting bounds establish that the negative-side LTI certificate is no slower than the positive-side switching certificate and can yield a faster exponential envelope. The analysis highlights a max-induced asymmetry in the error dynamics, linking it to overestimation bias, and supplies explicit finite-time bounds for both deterministic and stochastic constant-step-size cases.

Significance. If the central domination arguments hold, the work offers a conceptually useful refinement of finite-time Q-learning analysis by exploiting sign separation to expose an asymmetry not visible in standard uniform bounds. The connection to overestimation and the explicit comparison between LTI and switching certificates could inform tighter rate analyses or algorithm design in value-based RL. The provision of both deterministic and stochastic bounds strengthens the practical relevance of the theoretical contribution.

major comments (2)
  1. [switching-system representation and negative-error domination argument] The central claim that the negative error component is dominated by the LTI system associated with the fixed optimal policy (as described in the abstract and the starting point from the switching-system representation) requires explicit verification that the domination inequality is preserved under mode switches. When positive errors temporarily make a suboptimal action maximal, the effective transition matrix seen by the negative components changes; the manuscript must show that the comparison still holds in this case, e.g., via a specific lemma or theorem establishing invariance of the inequality under the argmax selection.
  2. [stochastic recursion bounds] The finite-time bounds for the stochastic case rely on the same sign-separated decomposition. It is unclear whether the concentration or martingale arguments used to control the noise term interact with the mode-switching induced by positive errors; a concrete statement of how the stochastic perturbation is handled while preserving the LTI lower comparison would strengthen the result.
minor comments (2)
  1. [error decomposition] Notation for the positive and negative error vectors (e.g., e^+ and e^-) should be introduced with an explicit definition immediately after the decomposition step to avoid ambiguity in later sections.
  2. [abstract and main theorem statement] The abstract states that the negative-side LTI certificate 'may produce a faster exponential envelope'; a brief remark on the conditions under which strict improvement occurs would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The sign-separated decomposition is intended to expose an asymmetry in Q-learning error dynamics that is not captured by uniform bounds. We address each major comment below and will revise the manuscript to provide the requested clarifications and explicit statements.

read point-by-point responses
  1. Referee: The central claim that the negative error component is dominated by the LTI system associated with the fixed optimal policy (as described in the abstract and the starting point from the switching-system representation) requires explicit verification that the domination inequality is preserved under mode switches. When positive errors temporarily make a suboptimal action maximal, the effective transition matrix seen by the negative components changes; the manuscript must show that the comparison still holds in this case, e.g., via a specific lemma or theorem establishing invariance of the inequality under the argmax selection.

    Authors: We agree that an explicit invariance statement would strengthen the presentation. The current proof proceeds by induction, showing that the negative-error vector remains componentwise above the trajectory of the optimal-policy LTI system for any sequence of switches; the key step uses the fact that any deviation from the optimal policy can only increase the selected Q-value and therefore cannot violate the lower comparison for the negative components. Nevertheless, we will add a dedicated lemma (new Lemma 3.4) that isolates this invariance property under arbitrary argmax-induced mode sequences and proves it directly from the componentwise ordering. The revised manuscript will place this lemma immediately after the switching-system representation and before the main domination theorem. revision: yes

  2. Referee: The finite-time bounds for the stochastic case rely on the same sign-separated decomposition. It is unclear whether the concentration or martingale arguments used to control the noise term interact with the mode-switching induced by positive errors; a concrete statement of how the stochastic perturbation is handled while preserving the LTI lower comparison would strengthen the result.

    Authors: We thank the referee for noting this point. In the stochastic analysis the noise is treated as an additive martingale difference sequence that is independent of the sign separation at each step. The LTI lower comparison is applied to the deterministic drift, while the stochastic deviation is bounded uniformly via a vector martingale inequality that holds regardless of the (finite) number of possible switching modes. To make the argument transparent we will insert a short proposition (new Proposition 4.3) that states the high-probability bound on the cumulative noise while explicitly preserving the componentwise domination by the optimal-policy LTI system. A brief remark will also be added explaining that the worst-case switching sequence is already accounted for by taking the supremum inside the concentration bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation starts from external switching-system representation

full rationale

The paper begins its analysis from the switching-system representation of Q-learning as a given starting point and then decomposes the error into componentwise negative and positive parts. The negative component is dominated by a lower-comparison LTI system associated with the fixed optimal policy, while the positive component is bounded via the switching dynamics. The claimed comparison of exponential envelopes between the negative-side LTI certificate and the positive-side switching certificate is obtained by direct analysis of these two systems' properties. No step equates a derived bound to its input by construction, renames a fitted quantity as a prediction, or relies on a load-bearing self-citation whose validity is assumed rather than independently verified. The central asymmetry result follows from the structure of the Bellman max operator applied to the decomposed errors and remains self-contained against the external representation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach rests on the switching-system representation of Q-learning dynamics as the starting point for decomposition; no free parameters, new entities, or additional axioms are indicated in the abstract.

axioms (1)
  • domain assumption Switching-system representation of the Q-learning error dynamics
    Explicitly stated as the starting point for the sign-separated decomposition in the abstract.

pith-pipeline@v0.9.0 · 5657 in / 1151 out tokens · 62219 ms · 2026-05-20T17:58:17.102105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    Error bounds for constant step-size Q-learning

    Carolyn L Beck and Rayadurgam Srikant. Error bounds for constant step-size Q-learning. Systems & control letters, 61(12):1203–1208, 2012

  2. [2]

    Bertsekas and John N

    Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-dynamic programming. Athena Scientific Belmont, MA, 1996

  3. [3]

    Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

    Vincent D Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005. 18

  4. [4]

    The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

    Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

  5. [5]

    A L yapunov theory for finite- sample guarantees of asynchronous Q-learning and TD-learning va riants,

    Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants.arXiv preprint arXiv:2102.01567, 2021

  6. [6]

    Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

    Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

  7. [7]

    Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

    Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

  8. [8]

    Springer Science & Business Media, 2009

    Raphaël Jungers.The joint spectral radius: Theory and applications, volume 385. Springer Science & Business Media, 2009

  9. [9]

    Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

    Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

  10. [10]

    George Yin.Stochastic approximation and recursive algorithms and applications, volume 35

    Harold Kushner and G. George Yin.Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003

  11. [11]

    Final iteration convergence bound of Q-learning: Switching system approach

    Donghwan Lee. Final iteration convergence bound of Q-learning: Switching system approach. IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

  12. [12]

    Lyapunov-Certified Direct Switching Theory for Q-Learning

    Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning.arXiv preprint arXiv:2604.19569, 2026

  13. [13]

    A unified switching system perspective and convergence analysis of Q-learning algorithms

    Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms. In34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

  14. [14]

    A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

    Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

  15. [15]

    Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021

    Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021

  16. [16]

    Springer Science & Business Media, 2003

    Daniel Liberzon.Switching in systems and control. Springer Science & Business Media, 2003

  17. [17]

    Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024

    Han-Dong Lim and Donghwan Lee. Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024

  18. [18]

    Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009

    Hai Lin and Panos J Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009

  19. [19]

    Puterman.Markov decision processes: Discrete stochastic dynamic programming

    Martin L. Puterman.Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014

  20. [20]

    Finite-time analysis of asynchronous stochastic approximation and Q-learning

    Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on learning theory, pages 3185–3205, 2020. 19

  21. [21]

    A note on the joint spectral radius.Indag

    Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indag. Math, 22(4): 379–381, 1960

  22. [22]

    Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

    Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

  23. [23]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement learning: An introduction. MIT Press, 1998

  24. [24]

    The asymptotic convergence-rate of Q-learning

    Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. InAdvances in Neural Information Processing Systems, pages 1064–1070, 1998

  25. [25]

    Issues in using function approximation for reinforcement learning

    Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Michael Mozer, Paul Smolensky, David Touretzky, Jeffrey Elman, and Andreas Weigend, editors,Proceedings of the 1993 Connectionist Models Summer School, pages 255–263. Lawrence Erlbaum, 1993

  26. [26]

    Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994

    John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994

  27. [27]

    John N Tsitsiklis and Vincent D Blondel. The lyapunov exponent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to approximate.Mathematics of Control, Signals and Systems, 10(1):31–40, 1997

  28. [28]

    Double q-learning

    Hado van Hasselt. Double q-learning. InAdvances in Neural Information Processing Systems, volume 23, pages 2613–2622. Curran Associates, Inc., 2010

  29. [29]

    Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

    Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpl∞- bounds for Q-learning.arXiv preprint arXiv:1905.06265, 2019

  30. [30]

    Q-learning.Machine learning, 8(3):279–292, 1992

    Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3):279–292, 1992. 20 Appendix A Auxiliary Lyapunov constructions The following lemma records the product-defined Lyapunov construction used in the finite-time arguments. Lemma 7.Let H:={A 1, A2, . . . , AM } ⊂R m×m, ρ:=ρ(H), and fix anyε >0such that βε :=ρ+ε∈(0,1). For a sequence of s...

  31. [31]

    Statement 2 follows from homogeneity of the Euclidean norm and from the fact thatvt+1 ε is obtained fromvt ε by adding one nonnegative term

    This proves Statement 3. Statement 2 follows from homogeneity of the Euclidean norm and from the fact thatvt+1 ε is obtained fromvt ε by adding one nonnegative term. For Statement 1, fixi∈ { 1, . . . , M}. For everyk≥ 0, each productAσAi, with σ∈ { 1, . . . , M}k, is a product of lengthk+ 1fromH. Therefore β−2 ε vt ε(Aix) = tX k=0 β−2(k+1) ε max σ∈{1,...,...

  32. [32]

    Taking conditional expectations and applying Lemma 8 proves the claim. C.2 Deterministic comparison lemmas For anyQande=Q−Q ∗, the Bellman-max term can be expanded state by state as VQ(s)−V ∗(s) = max a∈A {Q∗(s, a) +e(s, a)} −V ∗(s) = max a∈A {e(s, a)−A ∗(s, a)}. Ifπ∈Θ ∗, thenA ∗(s, π(s)) = 0, and therefore VQ −V ∗ −Π πe≥0. For the upper comparison, ifπ+(...