Heavy-Ball Q-Learning with Residual Weighting Correction

Donghwan Lee

arxiv: 2606.27112 · v1 · pith:EDRZINI5new · submitted 2026-06-25 · 💻 cs.LG · cs.AI

Heavy-Ball Q-Learning with Residual Weighting Correction

Donghwan Lee This is my paper

Pith reviewed 2026-06-26 05:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords heavy-ball Q-learningreinforcement learningconvergence analysisswitched linear systemsjoint spectral radiuslinear function approximationmomentum methods

0 comments

The pith

A corrected heavy-ball Q-learning method converges and accelerates compared to standard Q-learning under joint spectral radius conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a residual weighting correction for heavy-ball Q-learning and proves its convergence using a switched linear system representation of the algorithm updates. It identifies conditions based on the joint spectral radius of the switching family under which the corrected method converges faster than standard Q-learning. The same construction and analysis extend to the case of Q-learning with linear function approximation, yielding parallel convergence and acceleration guarantees. This switched linear system viewpoint supplies a new framework for examining how momentum affects Q-learning dynamics.

Core claim

The corrected heavy-ball Q-learning method is theoretically guaranteed to converge and converges faster than standard Q-learning under identified conditions; analogous convergence and acceleration statements hold for the linear function approximation case. The analysis rests on representing the algorithms as switched linear systems and bounding their joint spectral radius.

What carries the argument

Switched linear system representation of the Q-learning updates together with joint spectral radius analysis of the associated switching family.

If this is right

The corrected method converges to the optimal Q-function whenever the joint spectral radius is less than one.
Faster convergence than standard Q-learning occurs precisely when the joint spectral radius of the corrected family is smaller than that of the standard family.
The same convergence and acceleration guarantees apply when linear function approximation is introduced.
The switched linear system model supplies an alternative analytical route for studying momentum corrections in Q-learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The switched linear system lens could be used to derive corrections for other momentum variants such as Nesterov acceleration.
Parameter choices that minimize the joint spectral radius may directly improve practical convergence speed.
The framework might extend to analyzing momentum in related algorithms such as SARSA or actor-critic methods.

Load-bearing premise

The switched linear system representation accurately captures the dynamics of the heavy-ball Q-learning updates, allowing the joint spectral radius analysis to determine convergence and acceleration.

What would settle it

A concrete computation or simulation in which the joint spectral radius of the corrected method exceeds one yet the iterates still converge, or in which the radius is strictly smaller than that of standard Q-learning yet no speedup occurs.

Figures

Figures reproduced from arXiv: 2606.27112 by Donghwan Lee.

**Figure 2.** Figure 2: The plotted quantity is the median relative [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗

**Figure 3.** Figure 3: The plot contains exactly the three method curves i [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

**Figure 4.** Figure 4: Sampled stochastic version of the example in [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

read the original abstract

This paper proposes a corrected heavy-ball Q-learning method for reinforcement learning (RL) and establishes its convergence. It also identifies conditions under which the method is theoretically guaranteed to converge faster than standard Q-learning. The same construction is then extended to Q-learning with linear function approximation, where analogous convergence and acceleration statements are derived. The analysis is based on a switched linear system (SLS) representation of Q-learning algorithms and on the joint spectral radius (JSR) of the associated switching families. This SLS viewpoint is not commonly used in standard analyses of Q-learning, and it provides a complementary framework and new insight into how heavy-ball momentum can accelerate Q-learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a residual-corrected heavy-ball Q-learning update plus SLS-JSR convergence analysis for both tabular and linear FA cases, but the stochastic sampling to arbitrary switching step needs checking.

read the letter

The main takeaway is a new heavy-ball Q-learning variant that adds a residual weighting correction, with claimed convergence and faster rates than plain Q-learning under some conditions. The same construction is carried over to linear function approximation with matching statements. The analysis route is a switched linear system representation whose stability is read off the joint spectral radius of the switching family.

The SLS-JSR angle is the clearest novelty here. Standard Q-learning proofs usually go through contraction mappings or stochastic approximation; this one imports tools from switched systems, which is uncommon in the RL literature and gives a different way to see where momentum helps.

The algorithm construction looks deliberate. The residual term is introduced specifically to make the heavy-ball update fit the SLS form while preserving the desired acceleration properties.

The soft spot is the modeling step that turns the stochastic recursion into an SLS. The updates depend on which state-action pair is sampled at each step, and that sequence is generated by the behavior policy and the current estimates rather than chosen arbitrarily. The JSR bound therefore controls worst-case deterministic trajectories. Without an additional argument that carries the bound over to mean-square or almost-sure convergence under the actual sampling distribution, the guarantees rest on an unclosed gap. The abstract does not spell out how that gap is handled.

This is for readers who follow accelerated methods or non-standard proof techniques in RL theory. The formal effort and the new viewpoint are enough to justify sending it to referees, even if the stochastic-to-deterministic link requires tightening.

Referee Report

2 major / 1 minor

Summary. The paper proposes a corrected heavy-ball Q-learning method incorporating residual weighting correction. It claims to establish convergence (and identify conditions for faster convergence than standard Q-learning) via a switched linear system (SLS) representation of the updates together with joint spectral radius (JSR) analysis of the associated switching families. The same SLS+JSR construction is extended to Q-learning with linear function approximation, yielding analogous convergence and acceleration statements. The SLS viewpoint is presented as a complementary, non-standard framework for analyzing momentum in Q-learning.

Significance. If the central claims hold, the work supplies a novel analytical tool (SLS representation + JSR) for studying acceleration in tabular and linear-approximation Q-learning; this is a genuine departure from the usual contraction-mapping or supermartingale arguments and could yield new insight into momentum effects. The explicit identification of acceleration conditions and the extension to function approximation are also positive features.

major comments (2)

[Abstract / SLS representation] Abstract and the SLS-representation section: the convergence claim rests on rewriting the corrected heavy-ball update as an SLS whose stability is settled by JSR < 1. This modeling step implicitly treats the sequence of visited state-action pairs as an arbitrary switching signal. In the actual algorithm the switching is generated by the Markov chain induced by the behavior policy and the current Q-estimate; the resulting JSR therefore only upper-bounds the worst-case deterministic trajectory. No explicit contraction mapping, supermartingale, or mean-square analysis is supplied to close the gap between the deterministic SLS and the stochastic recursion, which is load-bearing for the central convergence guarantee.
[Linear function approximation section] The same modeling gap appears in the linear function approximation extension: the switched-linear-system representation must still be shown to imply almost-sure or mean-square convergence under the stochastic sampling induced by the behavior policy, rather than only deterministic worst-case stability.

minor comments (1)

Notation for the residual weighting correction term and the precise definition of the switching family should be introduced earlier and used consistently throughout the proofs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address the two major comments point by point below.

read point-by-point responses

Referee: [Abstract / SLS representation] Abstract and the SLS-representation section: the convergence claim rests on rewriting the corrected heavy-ball update as an SLS whose stability is settled by JSR < 1. This modeling step implicitly treats the sequence of visited state-action pairs as an arbitrary switching signal. In the actual algorithm the switching is generated by the Markov chain induced by the behavior policy and the current Q-estimate; the resulting JSR therefore only upper-bounds the worst-case deterministic trajectory. No explicit contraction mapping, supermartingale, or mean-square analysis is supplied to close the gap between the deterministic SLS and the stochastic recursion, which is load-bearing for the central convergence guarantee.

Authors: The joint spectral radius condition JSR < 1 establishes uniform asymptotic stability of the switched linear system for every possible switching sequence. The sequence of state-action pairs generated by any behavior policy (Markovian or otherwise) is simply one particular switching sequence; therefore the stability result applies directly to the realized trajectories of the algorithm. The residual weighting correction is constructed precisely so that the update recursion takes the exact SLS form. We will add a short clarifying paragraph in the SLS-representation section that explicitly connects the arbitrary-switching stability guarantee to the Markovian case and notes the implications for the underlying stochastic process. revision: yes
Referee: [Linear function approximation section] The same modeling gap appears in the linear function approximation extension: the switched-linear-system representation must still be shown to imply almost-sure or mean-square convergence under the stochastic sampling induced by the behavior policy, rather than only deterministic worst-case stability.

Authors: The identical JSR argument applies verbatim to the linear-function-approximation setting, because the switched-linear representation and the associated switching family are constructed analogously. We will insert a parallel clarifying sentence in that section. revision: yes

Circularity Check

0 steps flagged

No circularity; external SLS/JSR framework used

full rationale

The paper's convergence claims rest on rewriting the heavy-ball update as a switched linear system whose stability is analyzed via joint spectral radius. This modeling step and the JSR tool are presented as an external complementary framework drawn from control theory, not derived from or equivalent to the paper's own fitted parameters, self-citations, or target results by construction. No self-definitional loops, renamed empirical patterns, or load-bearing self-citation chains appear in the provided derivation outline. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper relies on standard mathematical background for convergence analysis in iterative methods and the validity of representing Q-learning as a switched linear system; no free parameters, ad-hoc axioms, or invented entities are mentioned.

pith-pipeline@v0.9.1-grok · 5624 in / 1087 out tokens · 22918 ms · 2026-06-26T05:15:12.428809+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 20 canonical work pages · 3 internal anchors

[1]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning . Machine Learning, 8(3–4):279– 292, 1992. doi:10.1007/BF00992698

work page doi:10.1007/bf00992698 1992
[2]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. Second edition, MIT Press, Cambridge, MA, 2018

2018
[3]

Kappen, Mohammad G

Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Az ar, and Rémi Munos. Speedy q-learning. In Advances in Neural Information Processing Systems 24 , pages 2411– 2419, 2011

2011
[4]

Pid a ccelerated value itera- tion algorithm

Amir-Massoud Farahmand and Mohammad Ghavamzadeh. Pid a ccelerated value itera- tion algorithm. In Proceedings of the 38th International Conference on Machin e Learning, Proceedings of Machine Learning Research, Vol. 139, pages 3 143–3153, PMLR, 2021

2021
[5]

A ﬁrst-order app roach to accelerated value it- eration

Vineet Goyal and Julien Grand-Clément. A ﬁrst-order app roach to accelerated value it- eration. Operations Research, 71(2):517–535, 2023. Published online March 24, 2022; doi:10.1287/opre.2022.2269

work page doi:10.1287/opre.2022.2269 2023
[6]

Accelerating procedure s of the value iteration algorithm for discounted markov decision processes, based on a one-step l ookahead analysis

Meir Herzberg and Uri Yechiali. Accelerating procedure s of the value iteration algorithm for discounted markov decision processes, based on a one-step l ookahead analysis. Operations Research, 42(5):940–946, 1994. doi:10.1287/opre.42.5.940

work page doi:10.1287/opre.42.5.940 1994
[7]

Acceleration op- erators in the value iteration algorithms for markov decisi on processes

Oleksandr Shlakhter, Chi-Guhn Lee, Dmitry Khmelev, and Nasser Jaber. Acceleration op- erators in the value iteration algorithms for markov decisi on processes. Operations Research, 58(1):193–202, 2010. doi:10.1287/opre.1090.0705

work page doi:10.1287/opre.1090.0705 2010
[8]

Bertsekas

Dimitri P. Bertsekas. Generic rank-one corrections for the value iteration in markovian decision problems. Operations Research Letters , 17(3):111–119, 1995. doi:10.1016/0167- 6377(95)00007-7

work page doi:10.1016/0167- 1995
[9]

Rank-one modiﬁed value iteration

Arman Shariﬁ Kolarijani, Tolga Ok, Peyman Mohajerin Esf ahani, and Mohamad Amin Shariﬁ Kolarijani. Rank-one modiﬁed value iteration. In Proceedings of the 42nd Interna- tional Conference on Machine Learning , Proceedings of Machine Learning Research, Vol. 267, pages 31182–31201, PMLR, 2025

2025
[10]

Ryu, and Amir-Mass oud Farahmand

Jongmin Lee, Amin Rakhsha, Ernest K. Ryu, and Amir-Mass oud Farahmand. Deﬂated dynamics value iteration. Transactions on Machine Learning Research , 2025

2025
[11]

Switching-Geometry Analysis of Deflated Q-Value Iteration

Donghwan Lee. Switching-geometry analysis of deﬂated q-value iteration . arXiv:2605.10811v2, 2026. doi:10.48550/arXiv.2605.108 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.108 2026
[12]

Spectral analysis of heavy-ball q-value iteration

Donghwan Lee. Spectral analysis of heavy-ball q-value iteration . Manuscript, 2026

2026
[13]

Analysis of q-learning with adaptation and momentum restart for gradient descent

Bowen Weng, Huaqing Xiong, Yingbin Liang, and Wei Zhang . Analysis of q-learning with adaptation and momentum restart for gradient descent. In Proceedings of the Twenty- Ninth International Joint Conference on Artiﬁcial Intelli gence, pages 3051–3057, 2020. doi:10.24963/ijcai.2020/422

work page doi:10.24963/ijcai.2020/422 2020
[14]

Finite-time theory for momentum q-learning

Bowen Weng, Huaqing Xiong, Lin Zhao, Yingbin Liang, and Wei Zhang. Finite-time theory for momentum q-learning. In Proceedings of the Thirty-Seventh Conference on Uncertain ty in Artiﬁcial Intelligence , Proceedings of Machine Learning Research, Vol. 161, pages 665– 674, PMLR, 2021

2021
[15]

Anderson accelerat ion for reinforcement learning

Matthieu Geist and Bruno Scherrer. Anderson accelerat ion for reinforcement learning. In Proceedings of the 14th European Workshop on Reinforcement Learning (EWRL 2018), Lille, France, 2018. 34

2018
[16]

Momentum in re- inforcement learning

Nino Vieillard, Bruno Scherrer, Olivier Pietquin, and Matthieu Geist. Momentum in re- inforcement learning. In Proceedings of the Twenty Third International Conference o n Artiﬁcial Intelligence and Statistics , Proceedings of Machine Learning Research, Vol. 108, pages 2529–2538, PMLR, 2020

2020
[17]

Jongmin Lee and Ernest K. Ryu. Accelerating value itera tion with anchoring. In Advances in Neural Information Processing Systems 36 , pages 53924–53963, 2023. doi:10.52202/075280- 2346

work page doi:10.52202/075280- 2023
[18]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andr ei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fi djeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antono glou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Hum an-level control through deep reinforceme...

work page doi:10.1038/nature14236 2015
[19]

Carl D. Meyer. Matrix analysis and applied linear algebra . Society for Industrial and Applied Mathematics, Philadelphia, 2000. ISBN 978-0-89871-454-8 ; doi:10.1137/1.9780898719512

work page doi:10.1137/1.9780898719512 2000
[20]

Raphaël M. Jungers. The joint spectral radius: theory and applications . Lecture Notes in Control and Information Sciences, Vol. 385, Spring er, Berlin, Heidelberg, 2009. doi:10.1007/978-3-540-95980-9

work page doi:10.1007/978-3-540-95980-9 2009
[21]

Eliahu I. Jury. Theory and application of the z-transform method . John Wiley & Sons, New York, 1964

1964
[22]

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee. Lyapunov-certiﬁed direct switching theory for q-learning . arXiv:2604.19569v3, 2026. doi:10.48550/arXiv.2604.195 69

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.195 2026
[23]

Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

Donghwan Lee. Beyond the bellman ﬁxed point: geometry and fast policy iden tiﬁcation in value iteration. arXiv:2604.17457v4, 2026. doi:10.48550/arXiv.2604.17 457

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.17 2026
[24]

Maguluri, Sanjay Shakkottai, and K arthikeyan Shanmugam

Zaiwei Chen, Siva T. Maguluri, Sanjay Shakkottai, and K arthikeyan Shanmugam. A lyapunov theory for ﬁnite-sample guarantees of markovian stochastic approxima- tion. Operations Research, 72(4):1352–1367, 2024. Published online October 6, 2023; doi:10.1287/opre.2022.0249

work page doi:10.1287/opre.2022.0249 2024
[25]

Keckler, and Simha Sethumadhavan

Daniel Liberzon. Switching in systems and control . Systems & Control: Foundations & Applications, Birkhäuser Boston, 2003. doi:10.1007/978- 1-4612-0017-8

work page doi:10.1007/978- 2003
[26]

Antsaklis

Hai Lin and Panos J. Antsaklis. Stability and stabiliza bility of switched linear systems: a survey of recent results. IEEE Transactions on Automatic Control , 54(2):308–322, 2009. doi:10.1109/TAC.2008.2012009

work page doi:10.1109/tac.2008.2012009 2009
[27]

Sta- bility criteria for switched and hybrid systems

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulﬀ, a nd Christopher King. Sta- bility criteria for switched and hybrid systems. SIAM Review , 49(4):545–592, 2007. doi:10.1137/05063516X

work page doi:10.1137/05063516x 2007
[28]

A note on the joint s pectral radius

Gian-Carlo Rota and Gilbert Strang. A note on the joint s pectral radius. Indagationes Mathematicae, 22:379–381, 1960. doi:10.1016/S1385-7258(60)50046-1

work page doi:10.1016/s1385-7258(60)50046-1 1960
[29]

Generating fu nctions of switched linear systems: analysis, computation, and stability applications

Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating fu nctions of switched linear systems: analysis, computation, and stability applications. IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011. doi:10.1109/TAC.2010.2067590

work page doi:10.1109/tac.2010.2067590 2011
[30]

Puterman

Martin L. Puterman. Markov decision processes: discrete stochastic dynamic pr ogram- ming. Wiley Series in Probability and Statistics, John Wiley & So ns, New York, 1994. doi:10.1002/9780470316887. 35

work page doi:10.1002/9780470316887 1994
[31]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming. Athena Scientiﬁc, Belmont, MA, 1996

1996
[32]

Residual algorithms: reinforcement lea rning with function approximation

Leemon Baird. Residual algorithms: reinforcement lea rning with function approximation. In Proceedings of the Twelfth International Conference on Mac hine Learning, pages 30–37. Morgan Kaufmann, 1995. 36 A Auxiliary results A.1 Nonnegative matrix row-sum bounds The following standard row-sum bound is the form of the Colla tz–Wielandt inequality used below...

1995

[1] [1]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning . Machine Learning, 8(3–4):279– 292, 1992. doi:10.1007/BF00992698

work page doi:10.1007/bf00992698 1992

[2] [2]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. Second edition, MIT Press, Cambridge, MA, 2018

2018

[3] [3]

Kappen, Mohammad G

Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Az ar, and Rémi Munos. Speedy q-learning. In Advances in Neural Information Processing Systems 24 , pages 2411– 2419, 2011

2011

[4] [4]

Pid a ccelerated value itera- tion algorithm

Amir-Massoud Farahmand and Mohammad Ghavamzadeh. Pid a ccelerated value itera- tion algorithm. In Proceedings of the 38th International Conference on Machin e Learning, Proceedings of Machine Learning Research, Vol. 139, pages 3 143–3153, PMLR, 2021

2021

[5] [5]

A ﬁrst-order app roach to accelerated value it- eration

Vineet Goyal and Julien Grand-Clément. A ﬁrst-order app roach to accelerated value it- eration. Operations Research, 71(2):517–535, 2023. Published online March 24, 2022; doi:10.1287/opre.2022.2269

work page doi:10.1287/opre.2022.2269 2023

[6] [6]

Accelerating procedure s of the value iteration algorithm for discounted markov decision processes, based on a one-step l ookahead analysis

Meir Herzberg and Uri Yechiali. Accelerating procedure s of the value iteration algorithm for discounted markov decision processes, based on a one-step l ookahead analysis. Operations Research, 42(5):940–946, 1994. doi:10.1287/opre.42.5.940

work page doi:10.1287/opre.42.5.940 1994

[7] [7]

Acceleration op- erators in the value iteration algorithms for markov decisi on processes

Oleksandr Shlakhter, Chi-Guhn Lee, Dmitry Khmelev, and Nasser Jaber. Acceleration op- erators in the value iteration algorithms for markov decisi on processes. Operations Research, 58(1):193–202, 2010. doi:10.1287/opre.1090.0705

work page doi:10.1287/opre.1090.0705 2010

[8] [8]

Bertsekas

Dimitri P. Bertsekas. Generic rank-one corrections for the value iteration in markovian decision problems. Operations Research Letters , 17(3):111–119, 1995. doi:10.1016/0167- 6377(95)00007-7

work page doi:10.1016/0167- 1995

[9] [9]

Rank-one modiﬁed value iteration

Arman Shariﬁ Kolarijani, Tolga Ok, Peyman Mohajerin Esf ahani, and Mohamad Amin Shariﬁ Kolarijani. Rank-one modiﬁed value iteration. In Proceedings of the 42nd Interna- tional Conference on Machine Learning , Proceedings of Machine Learning Research, Vol. 267, pages 31182–31201, PMLR, 2025

2025

[10] [10]

Ryu, and Amir-Mass oud Farahmand

Jongmin Lee, Amin Rakhsha, Ernest K. Ryu, and Amir-Mass oud Farahmand. Deﬂated dynamics value iteration. Transactions on Machine Learning Research , 2025

2025

[11] [11]

Switching-Geometry Analysis of Deflated Q-Value Iteration

Donghwan Lee. Switching-geometry analysis of deﬂated q-value iteration . arXiv:2605.10811v2, 2026. doi:10.48550/arXiv.2605.108 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.108 2026

[12] [12]

Spectral analysis of heavy-ball q-value iteration

Donghwan Lee. Spectral analysis of heavy-ball q-value iteration . Manuscript, 2026

2026

[13] [13]

Analysis of q-learning with adaptation and momentum restart for gradient descent

Bowen Weng, Huaqing Xiong, Yingbin Liang, and Wei Zhang . Analysis of q-learning with adaptation and momentum restart for gradient descent. In Proceedings of the Twenty- Ninth International Joint Conference on Artiﬁcial Intelli gence, pages 3051–3057, 2020. doi:10.24963/ijcai.2020/422

work page doi:10.24963/ijcai.2020/422 2020

[14] [14]

Finite-time theory for momentum q-learning

Bowen Weng, Huaqing Xiong, Lin Zhao, Yingbin Liang, and Wei Zhang. Finite-time theory for momentum q-learning. In Proceedings of the Thirty-Seventh Conference on Uncertain ty in Artiﬁcial Intelligence , Proceedings of Machine Learning Research, Vol. 161, pages 665– 674, PMLR, 2021

2021

[15] [15]

Anderson accelerat ion for reinforcement learning

Matthieu Geist and Bruno Scherrer. Anderson accelerat ion for reinforcement learning. In Proceedings of the 14th European Workshop on Reinforcement Learning (EWRL 2018), Lille, France, 2018. 34

2018

[16] [16]

Momentum in re- inforcement learning

Nino Vieillard, Bruno Scherrer, Olivier Pietquin, and Matthieu Geist. Momentum in re- inforcement learning. In Proceedings of the Twenty Third International Conference o n Artiﬁcial Intelligence and Statistics , Proceedings of Machine Learning Research, Vol. 108, pages 2529–2538, PMLR, 2020

2020

[17] [17]

Jongmin Lee and Ernest K. Ryu. Accelerating value itera tion with anchoring. In Advances in Neural Information Processing Systems 36 , pages 53924–53963, 2023. doi:10.52202/075280- 2346

work page doi:10.52202/075280- 2023

[18] [18]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andr ei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fi djeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antono glou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Hum an-level control through deep reinforceme...

work page doi:10.1038/nature14236 2015

[19] [19]

Carl D. Meyer. Matrix analysis and applied linear algebra . Society for Industrial and Applied Mathematics, Philadelphia, 2000. ISBN 978-0-89871-454-8 ; doi:10.1137/1.9780898719512

work page doi:10.1137/1.9780898719512 2000

[20] [20]

Raphaël M. Jungers. The joint spectral radius: theory and applications . Lecture Notes in Control and Information Sciences, Vol. 385, Spring er, Berlin, Heidelberg, 2009. doi:10.1007/978-3-540-95980-9

work page doi:10.1007/978-3-540-95980-9 2009

[21] [21]

Eliahu I. Jury. Theory and application of the z-transform method . John Wiley & Sons, New York, 1964

1964

[22] [22]

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee. Lyapunov-certiﬁed direct switching theory for q-learning . arXiv:2604.19569v3, 2026. doi:10.48550/arXiv.2604.195 69

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.195 2026

[23] [23]

Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

Donghwan Lee. Beyond the bellman ﬁxed point: geometry and fast policy iden tiﬁcation in value iteration. arXiv:2604.17457v4, 2026. doi:10.48550/arXiv.2604.17 457

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.17 2026

[24] [24]

Maguluri, Sanjay Shakkottai, and K arthikeyan Shanmugam

Zaiwei Chen, Siva T. Maguluri, Sanjay Shakkottai, and K arthikeyan Shanmugam. A lyapunov theory for ﬁnite-sample guarantees of markovian stochastic approxima- tion. Operations Research, 72(4):1352–1367, 2024. Published online October 6, 2023; doi:10.1287/opre.2022.0249

work page doi:10.1287/opre.2022.0249 2024

[25] [25]

Keckler, and Simha Sethumadhavan

Daniel Liberzon. Switching in systems and control . Systems & Control: Foundations & Applications, Birkhäuser Boston, 2003. doi:10.1007/978- 1-4612-0017-8

work page doi:10.1007/978- 2003

[26] [26]

Antsaklis

Hai Lin and Panos J. Antsaklis. Stability and stabiliza bility of switched linear systems: a survey of recent results. IEEE Transactions on Automatic Control , 54(2):308–322, 2009. doi:10.1109/TAC.2008.2012009

work page doi:10.1109/tac.2008.2012009 2009

[27] [27]

Sta- bility criteria for switched and hybrid systems

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulﬀ, a nd Christopher King. Sta- bility criteria for switched and hybrid systems. SIAM Review , 49(4):545–592, 2007. doi:10.1137/05063516X

work page doi:10.1137/05063516x 2007

[28] [28]

A note on the joint s pectral radius

Gian-Carlo Rota and Gilbert Strang. A note on the joint s pectral radius. Indagationes Mathematicae, 22:379–381, 1960. doi:10.1016/S1385-7258(60)50046-1

work page doi:10.1016/s1385-7258(60)50046-1 1960

[29] [29]

Generating fu nctions of switched linear systems: analysis, computation, and stability applications

Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating fu nctions of switched linear systems: analysis, computation, and stability applications. IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011. doi:10.1109/TAC.2010.2067590

work page doi:10.1109/tac.2010.2067590 2011

[30] [30]

Puterman

Martin L. Puterman. Markov decision processes: discrete stochastic dynamic pr ogram- ming. Wiley Series in Probability and Statistics, John Wiley & So ns, New York, 1994. doi:10.1002/9780470316887. 35

work page doi:10.1002/9780470316887 1994

[31] [31]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming. Athena Scientiﬁc, Belmont, MA, 1996

1996

[32] [32]

Residual algorithms: reinforcement lea rning with function approximation

Leemon Baird. Residual algorithms: reinforcement lea rning with function approximation. In Proceedings of the Twelfth International Conference on Mac hine Learning, pages 30–37. Morgan Kaufmann, 1995. 36 A Auxiliary results A.1 Nonnegative matrix row-sum bounds The following standard row-sum bound is the form of the Colla tz–Wielandt inequality used below...

1995