Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics
Pith reviewed 2026-06-28 16:07 UTC · model grok-4.3
The pith
Periodic and soft target updates guarantee linear Q-learning convergence to the projected Q-Bellman solution under spectral and step-size conditions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Although linear Q-learning can fail to converge in general, under explicit spectral and step-size conditions, periodic hard target updates and soft target updates can guarantee convergence to the exact projected Q-Bellman solution. The main analysis uses the exact switched linear system dynamics induced by the Bellman maximum and certifies stability via the joint spectral radius of the resulting switching matrix families.
What carries the argument
Switched linear system dynamics induced by the Bellman maximum, with convergence certified by the joint spectral radius of the switching matrix families.
If this is right
- The mean recursion of the deterministic system converges when the joint spectral radius condition holds.
- The stochastic reinforcement-learning case follows from the mean convergence plus a separate noise analysis.
- Both periodic hard target updates and soft target updates fall under the same switched-system certificate.
- Convergence is to the exact projected Q-Bellman solution rather than an approximation of it.
Where Pith is reading between the lines
- The same switched-system lens could be applied to analyze other common stabilization tricks such as experience replay buffers.
- Practitioners could compute or bound the joint spectral radius to select safe target-update periods before running experiments.
- The deterministic analysis supplies a clear template for extending the argument to nonlinear function approximation if analogous spectral bounds can be obtained.
Load-bearing premise
The target-update mechanism is accurately captured by the exact switched linear system dynamics induced by the Bellman maximum.
What would settle it
A concrete linear Q-learning instance with periodic target updates where the joint spectral radius of the switched matrices is computed to be less than one, yet the iterates are observed to diverge.
Figures
read the original abstract
Periodic target updates in Q-learning and soft target updates in actor-critic methods are empirically well established stabilization mechanisms, but their precise theoretical explanation is still incomplete. This paper gives a rigorous and exact analysis of these mechanisms for Q-learning with linear function approximation (linear Q-learning) using the exact switched linear system (SLS) dynamics induced by the Bellman maximum and the joint spectral radius (JSR) of the resulting switching matrix families. Although linear Q-learning can fail to converge in general, we prove that, under explicit spectral and step-size conditions, periodic hard target updates and soft target updates can guarantee convergence to the exact projected Q-Bellman solution. The main analysis is carried out for deterministic linear Q-learning, where the target-update mechanism is most transparent. Once the corresponding JSR certificate is established for the mean recursion, the stochastic reinforcement-learning setting can be treated by replacing deterministic modes with sampled stochastic modes and adding the corresponding stochastic-noise analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that periodic hard target updates and soft target updates stabilize linear Q-learning with function approximation by inducing switched linear system dynamics from the Bellman maximum operator; under explicit conditions on the joint spectral radius (JSR) of the resulting matrix family and on step sizes, the system converges to the exact projected Q-Bellman fixed point. The deterministic case receives the main rigorous treatment via JSR certificates, with a sketch for the stochastic case via mean recursion plus noise analysis.
Significance. If the JSR-based certificates are valid, the work supplies the first explicit, non-asymptotic stability conditions explaining why target networks empirically stabilize Q-learning, moving beyond generic contraction arguments to control-theoretic switched-system analysis. The parameter-free nature of the JSR certificate (once the switching family is fixed) and the exact modeling of the Bellman max as finite modes for finite action spaces are particular strengths.
major comments (2)
- [Abstract and §1 (stochastic extension paragraph)] The stochastic extension (mentioned in the abstract and §1) replaces deterministic modes with sampled stochastic modes and adds noise analysis after establishing the JSR certificate on the mean recursion, but no explicit error bounds, concentration inequalities, or statement of the precise sense of convergence (almost-sure, in expectation, or with high probability) are supplied; this leaves the claim that the stochastic setting “can be treated” unverified as a load-bearing step.
- [§3 (switched linear system construction)] The modeling assumption that the Bellman maximum induces an exact finite family of linear modes (and thus a well-defined switching matrix family whose JSR can be computed or bounded) is stated without an explicit construction or enumeration of the modes for a concrete MDP; without this, it is impossible to check whether the claimed JSR < 1 conditions are non-vacuous or merely restate the fixed-point property.
minor comments (2)
- [§4 (soft updates)] Notation for the augmented state in the soft-update case is introduced without a clear diagram or recurrence relating the target network parameters to the online parameters; a small example would clarify the dimension of the switched system.
- [Introduction] The paper cites control-theoretic JSR results but does not compare the obtained stability conditions against existing contraction-mapping or Lyapunov analyses of target networks; a short related-work paragraph would help situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the JSR-based stability analysis and for the constructive major comments. We address each point below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract and §1 (stochastic extension paragraph)] The stochastic extension (mentioned in the abstract and §1) replaces deterministic modes with sampled stochastic modes and adds noise analysis after establishing the JSR certificate on the mean recursion, but no explicit error bounds, concentration inequalities, or statement of the precise sense of convergence (almost-sure, in expectation, or with high probability) are supplied; this leaves the claim that the stochastic setting “can be treated” unverified as a load-bearing step.
Authors: We agree that the stochastic extension is only sketched at a high level. The primary rigorous contribution is the deterministic switched-linear-system analysis via JSR. The stochastic paragraph indicates that once the mean recursion is controlled by the JSR certificate, standard stochastic-approximation arguments (e.g., martingale noise terms vanishing under appropriate step-size conditions) can be applied, but no explicit bounds or convergence mode are derived. In revision we will (i) qualify the abstract and §1 to state that the stochastic case is indicated as a direct extension rather than fully developed, and (ii) add a short paragraph outlining the precise sense of convergence in expectation that follows from the mean recursion plus standard noise analysis, without claiming new concentration results. revision: partial
-
Referee: [§3 (switched linear system construction)] The modeling assumption that the Bellman maximum induces an exact finite family of linear modes (and thus a well-defined switching matrix family whose JSR can be computed or bounded) is stated without an explicit construction or enumeration of the modes for a concrete MDP; without this, it is impossible to check whether the claimed JSR < 1 conditions are non-vacuous or merely restate the fixed-point property.
Authors: The finite family arises because, with finite actions and linear features, the Bellman optimality operator (after projection) selects, at each step, one of |A| possible linear maps corresponding to the greedy action; the resulting matrix family is therefore finite. We acknowledge that §3 presents this construction abstractly without a worked numerical example. In the revision we will insert a short illustrative subsection (or appendix) that takes a small finite MDP, explicitly enumerates the |A| matrices in the switching family, computes or bounds their joint spectral radius, and verifies that JSR < 1 is a non-trivial condition that is independent of the fixed-point property itself. revision: yes
Circularity Check
No significant circularity; derivation uses external JSR theory on standard switched-system model
full rationale
The paper models target updates as inducing a finite family of linear modes from the Bellman max operator (standard for finite actions), then invokes the joint spectral radius of that family being <1 under explicit step-size/spectral conditions to certify convergence to the projected fixed point. This is a direct application of existing control-theoretic results to the deterministic mean recursion, followed by a separate stochastic-noise argument; no parameter is fitted inside the paper and then renamed as a prediction, no self-citation supplies the uniqueness or convergence certificate, and the modeling assumptions are stated explicitly rather than smuggled. The central claim therefore remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Joint spectral radius of a matrix family less than one implies asymptotic stability of the corresponding switched linear system
Forward citations
Cited by 1 Pith paper
-
Geometrically Averaged Hard Target Updates for Linear Q-Learning
Introduces and analyzes the λ-target update for linear Q-learning via geometric averaging of periodic target maps, studied with a switching-system model in the deterministic case.
Reference graph
Works this paper leans on
-
[1]
Bertsekas and John N
Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. Target Updates and Linear Q-Learning 23
1996
-
[2]
Blondel and Yurii Nesterov
Vincent D. Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005
2005
-
[3]
Borkar and Sean P
Vivek S. Borkar and Sean P. Meyn. The ODE method for convergence of stochastic approxima- tion and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000
2000
-
[4]
Ramirez, Christopher K
Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurmans. Target networks and over- parameterization stabilize off-policy bootstrapping with function approximation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine...
2024
-
[5]
Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023. doi:10.1137/22M1499261
-
[6]
A note on the joint spectral radius
Antonio Cicone. A note on the joint spectral radius. arXiv preprint arXiv:1502.01506, 2015
Pith/arXiv arXiv 2015
-
[7]
Mattie Fellows, Matthew J. A. Smith, and Shimon Whiteson. Why target networks stabilise temporal difference methods. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 9886–9909. PMLR, 2023
2023
-
[8]
Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011
Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011
2011
-
[9]
Convergence of stochastic iterative dynamic programming algorithms.Advances in Neural Information Processing Systems, 6, 1993
Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in Neural Information Processing Systems, 6, 1993
1993
-
[10]
Continuity of the joint spectral radius: application to wavelets
Christopher Heil and Gilbert Strang. Continuity of the joint spectral radius: application to wavelets. In A. Bojanczyk and G. Cybenko, editors,Linear Algebra for Signal Processing, volume 69 ofThe IMA Volumes in Mathematics and its Applications, pages 51–61. Springer, 1995
1995
-
[11]
Springer, volume 385, 2009
Raphaël Jungers.The Joint Spectral Radius: Theory and Applications. Springer, volume 385, 2009
2009
-
[12]
John Wiley & Sons, New York, 1978
Erwin Kreyszig.Introductory Functional Analysis with Applications. John Wiley & Sons, New York, 1978
1978
-
[13]
Lyapunov-certified direct switching theory for Q-learning
Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning. arXiv preprint arXiv:2604.19569, 2026
Pith/arXiv arXiv 2026
-
[14]
Target-based temporal-difference learning
Donghwan Lee and Niao He. Target-based temporal-difference learning. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3713–3722. PMLR, 2019
2019
-
[15]
Periodic Q-learning
Donghwan Lee and Niao He. Periodic Q-learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 ofProceedings of Machine Learning Research, pages 582–598. PMLR, 2020. Target Updates and Linear Q-Learning 24
2020
-
[16]
A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023
Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023
2023
-
[17]
A switching system theory of Q-learning with linear function approximation
Donghwan Lee and Han-Dong Lim. A switching system theory of Q-learning with linear function approximation. arXiv preprint arXiv:2605.11021, 2026.https://arxiv.org/pdf/2605.11021
Pith/arXiv arXiv 2026
-
[18]
Springer Science & Business Media, 2003
Daniel Liberzon.Switching in Systems and Control. Springer Science & Business Media, 2003
2003
-
[19]
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learn- ing. InProceedings of the 4th International Conference on Learning Representations, 2016. arXiv:1509.02971
Pith/arXiv arXiv 2016
-
[20]
Han-Dong Lim and Donghwan Lee. Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration. arXiv preprint arXiv:2504.10865, 2025
arXiv 2025
-
[21]
Antsaklis
Hai Lin and Panos J. Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic Control, 54(2):308–322, 2009
2009
-
[22]
Sean P. Meyn. The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control, 69(12):8323–8337, 2024. doi:10.1109/TAC.2024.3409647
-
[23]
Rusu, Joel Veness, Marc G
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement...
2015
-
[24]
Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming
Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, 1994
1994
-
[25]
A note on the joint spectral radius.Indagationes Mathematicae, 22(4):379–381, 1960
Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indagationes Mathematicae, 22(4):379–381, 1960
1960
-
[26]
Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007
Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007
2007
-
[27]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 1998
1998
-
[28]
Tsitsiklis
John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine Learning, 16(3):185–202, 1994
1994
-
[29]
Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8(3):279–292, 1992
1992
-
[30]
Breaking the deadly triad with a target network
Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Breaking the deadly triad with a target network. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12621–12631. PMLR, 2021. Appendix Target Updates and Linear Q-Learning 25 A Tabular Case: PQVI, DLQL, andm-DLQL This appen...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.