pith. sign in

arxiv: 2606.02645 · v1 · pith:O74PBKHInew · submitted 2026-05-31 · 📊 stat.ML · cs.AI· cs.LG

Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

Pith reviewed 2026-06-28 16:07 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords linear Q-learningtarget updatesconvergence analysisswitched linear systemsjoint spectral radiusreinforcement learningfunction approximation
0
0 comments X

The pith

Periodic and soft target updates guarantee linear Q-learning convergence to the projected Q-Bellman solution under spectral and step-size conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear Q-learning with linear function approximation can fail to converge in general. This paper proves that periodic hard target updates and soft target updates stabilize the process and drive it to the exact fixed point of the projected Bellman equation. The argument models the combined updates as a switched linear system whose joint spectral radius must be less than one. Explicit conditions on the spectrum of the matrices and the learning rate ensure this radius stays below one. The deterministic analysis extends to the stochastic case once the mean recursion is shown to converge.

Core claim

Although linear Q-learning can fail to converge in general, under explicit spectral and step-size conditions, periodic hard target updates and soft target updates can guarantee convergence to the exact projected Q-Bellman solution. The main analysis uses the exact switched linear system dynamics induced by the Bellman maximum and certifies stability via the joint spectral radius of the resulting switching matrix families.

What carries the argument

Switched linear system dynamics induced by the Bellman maximum, with convergence certified by the joint spectral radius of the switching matrix families.

If this is right

  • The mean recursion of the deterministic system converges when the joint spectral radius condition holds.
  • The stochastic reinforcement-learning case follows from the mean convergence plus a separate noise analysis.
  • Both periodic hard target updates and soft target updates fall under the same switched-system certificate.
  • Convergence is to the exact projected Q-Bellman solution rather than an approximation of it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same switched-system lens could be applied to analyze other common stabilization tricks such as experience replay buffers.
  • Practitioners could compute or bound the joint spectral radius to select safe target-update periods before running experiments.
  • The deterministic analysis supplies a clear template for extending the argument to nonlinear function approximation if analogous spectral bounds can be obtained.

Load-bearing premise

The target-update mechanism is accurately captured by the exact switched linear system dynamics induced by the Bellman maximum.

What would settle it

A concrete linear Q-learning instance with periodic target updates where the joint spectral radius of the switched matrices is computed to be less than one, yet the iterates are observed to diverge.

Figures

Figures reproduced from arXiv: 2606.02645 by Donghwan Lee.

Figure 1
Figure 1. Figure 1: Hard-target period length as an interpolation parameter. The period [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
read the original abstract

Periodic target updates in Q-learning and soft target updates in actor-critic methods are empirically well established stabilization mechanisms, but their precise theoretical explanation is still incomplete. This paper gives a rigorous and exact analysis of these mechanisms for Q-learning with linear function approximation (linear Q-learning) using the exact switched linear system (SLS) dynamics induced by the Bellman maximum and the joint spectral radius (JSR) of the resulting switching matrix families. Although linear Q-learning can fail to converge in general, we prove that, under explicit spectral and step-size conditions, periodic hard target updates and soft target updates can guarantee convergence to the exact projected Q-Bellman solution. The main analysis is carried out for deterministic linear Q-learning, where the target-update mechanism is most transparent. Once the corresponding JSR certificate is established for the mean recursion, the stochastic reinforcement-learning setting can be treated by replacing deterministic modes with sampled stochastic modes and adding the corresponding stochastic-noise analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that periodic hard target updates and soft target updates stabilize linear Q-learning with function approximation by inducing switched linear system dynamics from the Bellman maximum operator; under explicit conditions on the joint spectral radius (JSR) of the resulting matrix family and on step sizes, the system converges to the exact projected Q-Bellman fixed point. The deterministic case receives the main rigorous treatment via JSR certificates, with a sketch for the stochastic case via mean recursion plus noise analysis.

Significance. If the JSR-based certificates are valid, the work supplies the first explicit, non-asymptotic stability conditions explaining why target networks empirically stabilize Q-learning, moving beyond generic contraction arguments to control-theoretic switched-system analysis. The parameter-free nature of the JSR certificate (once the switching family is fixed) and the exact modeling of the Bellman max as finite modes for finite action spaces are particular strengths.

major comments (2)
  1. [Abstract and §1 (stochastic extension paragraph)] The stochastic extension (mentioned in the abstract and §1) replaces deterministic modes with sampled stochastic modes and adds noise analysis after establishing the JSR certificate on the mean recursion, but no explicit error bounds, concentration inequalities, or statement of the precise sense of convergence (almost-sure, in expectation, or with high probability) are supplied; this leaves the claim that the stochastic setting “can be treated” unverified as a load-bearing step.
  2. [§3 (switched linear system construction)] The modeling assumption that the Bellman maximum induces an exact finite family of linear modes (and thus a well-defined switching matrix family whose JSR can be computed or bounded) is stated without an explicit construction or enumeration of the modes for a concrete MDP; without this, it is impossible to check whether the claimed JSR < 1 conditions are non-vacuous or merely restate the fixed-point property.
minor comments (2)
  1. [§4 (soft updates)] Notation for the augmented state in the soft-update case is introduced without a clear diagram or recurrence relating the target network parameters to the online parameters; a small example would clarify the dimension of the switched system.
  2. [Introduction] The paper cites control-theoretic JSR results but does not compare the obtained stability conditions against existing contraction-mapping or Lyapunov analyses of target networks; a short related-work paragraph would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the JSR-based stability analysis and for the constructive major comments. We address each point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract and §1 (stochastic extension paragraph)] The stochastic extension (mentioned in the abstract and §1) replaces deterministic modes with sampled stochastic modes and adds noise analysis after establishing the JSR certificate on the mean recursion, but no explicit error bounds, concentration inequalities, or statement of the precise sense of convergence (almost-sure, in expectation, or with high probability) are supplied; this leaves the claim that the stochastic setting “can be treated” unverified as a load-bearing step.

    Authors: We agree that the stochastic extension is only sketched at a high level. The primary rigorous contribution is the deterministic switched-linear-system analysis via JSR. The stochastic paragraph indicates that once the mean recursion is controlled by the JSR certificate, standard stochastic-approximation arguments (e.g., martingale noise terms vanishing under appropriate step-size conditions) can be applied, but no explicit bounds or convergence mode are derived. In revision we will (i) qualify the abstract and §1 to state that the stochastic case is indicated as a direct extension rather than fully developed, and (ii) add a short paragraph outlining the precise sense of convergence in expectation that follows from the mean recursion plus standard noise analysis, without claiming new concentration results. revision: partial

  2. Referee: [§3 (switched linear system construction)] The modeling assumption that the Bellman maximum induces an exact finite family of linear modes (and thus a well-defined switching matrix family whose JSR can be computed or bounded) is stated without an explicit construction or enumeration of the modes for a concrete MDP; without this, it is impossible to check whether the claimed JSR < 1 conditions are non-vacuous or merely restate the fixed-point property.

    Authors: The finite family arises because, with finite actions and linear features, the Bellman optimality operator (after projection) selects, at each step, one of |A| possible linear maps corresponding to the greedy action; the resulting matrix family is therefore finite. We acknowledge that §3 presents this construction abstractly without a worked numerical example. In the revision we will insert a short illustrative subsection (or appendix) that takes a small finite MDP, explicitly enumerates the |A| matrices in the switching family, computes or bounds their joint spectral radius, and verifies that JSR < 1 is a non-trivial condition that is independent of the fixed-point property itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external JSR theory on standard switched-system model

full rationale

The paper models target updates as inducing a finite family of linear modes from the Bellman max operator (standard for finite actions), then invokes the joint spectral radius of that family being <1 under explicit step-size/spectral conditions to certify convergence to the projected fixed point. This is a direct application of existing control-theoretic results to the deterministic mean recursion, followed by a separate stochastic-noise argument; no parameter is fitted inside the paper and then renamed as a prediction, no self-citation supplies the uniqueness or convergence certificate, and the modeling assumptions are stated explicitly rather than smuggled. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard mathematical properties of the joint spectral radius for switched linear systems and on the modeling assumption that Q-learning target updates induce a switched linear system whose stability is governed by that radius.

axioms (1)
  • standard math Joint spectral radius of a matrix family less than one implies asymptotic stability of the corresponding switched linear system
    Standard result from switched systems theory invoked to certify convergence.

pith-pipeline@v0.9.1-grok · 5690 in / 1213 out tokens · 23292 ms · 2026-06-28T16:07:42.101146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Geometrically Averaged Hard Target Updates for Linear Q-Learning

    cs.LG 2026-06 unverdicted novelty 6.0

    Introduces and analyzes the λ-target update for linear Q-learning via geometric averaging of periodic target maps, studied with a switching-system model in the deterministic case.

Reference graph

Works this paper leans on

30 extracted references · 2 canonical work pages · cited by 1 Pith paper

  1. [1]

    Bertsekas and John N

    Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. Target Updates and Linear Q-Learning 23

  2. [2]

    Blondel and Yurii Nesterov

    Vincent D. Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

  3. [3]

    Borkar and Sean P

    Vivek S. Borkar and Sean P. Meyn. The ODE method for convergence of stochastic approxima- tion and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

  4. [4]

    Ramirez, Christopher K

    Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurmans. Target networks and over- parameterization stabilize off-policy bootstrapping with function approximation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine...

  5. [5]

    Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

    Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023. doi:10.1137/22M1499261

  6. [6]

    A note on the joint spectral radius

    Antonio Cicone. A note on the joint spectral radius. arXiv preprint arXiv:1502.01506, 2015

  7. [7]

    Mattie Fellows, Matthew J. A. Smith, and Shimon Whiteson. Why target networks stabilise temporal difference methods. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 9886–9909. PMLR, 2023

  8. [8]

    Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011

    Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011

  9. [9]

    Convergence of stochastic iterative dynamic programming algorithms.Advances in Neural Information Processing Systems, 6, 1993

    Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in Neural Information Processing Systems, 6, 1993

  10. [10]

    Continuity of the joint spectral radius: application to wavelets

    Christopher Heil and Gilbert Strang. Continuity of the joint spectral radius: application to wavelets. In A. Bojanczyk and G. Cybenko, editors,Linear Algebra for Signal Processing, volume 69 ofThe IMA Volumes in Mathematics and its Applications, pages 51–61. Springer, 1995

  11. [11]

    Springer, volume 385, 2009

    Raphaël Jungers.The Joint Spectral Radius: Theory and Applications. Springer, volume 385, 2009

  12. [12]

    John Wiley & Sons, New York, 1978

    Erwin Kreyszig.Introductory Functional Analysis with Applications. John Wiley & Sons, New York, 1978

  13. [13]

    Lyapunov-certified direct switching theory for Q-learning

    Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning. arXiv preprint arXiv:2604.19569, 2026

  14. [14]

    Target-based temporal-difference learning

    Donghwan Lee and Niao He. Target-based temporal-difference learning. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3713–3722. PMLR, 2019

  15. [15]

    Periodic Q-learning

    Donghwan Lee and Niao He. Periodic Q-learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 ofProceedings of Machine Learning Research, pages 582–598. PMLR, 2020. Target Updates and Linear Q-Learning 24

  16. [16]

    A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

    Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

  17. [17]

    A switching system theory of Q-learning with linear function approximation

    Donghwan Lee and Han-Dong Lim. A switching system theory of Q-learning with linear function approximation. arXiv preprint arXiv:2605.11021, 2026.https://arxiv.org/pdf/2605.11021

  18. [18]

    Springer Science & Business Media, 2003

    Daniel Liberzon.Switching in Systems and Control. Springer Science & Business Media, 2003

  19. [19]

    Lillicrap, Jonathan J

    Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learn- ing. InProceedings of the 4th International Conference on Learning Representations, 2016. arXiv:1509.02971

  20. [20]

    Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration

    Han-Dong Lim and Donghwan Lee. Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration. arXiv preprint arXiv:2504.10865, 2025

  21. [21]

    Antsaklis

    Hai Lin and Panos J. Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic Control, 54(2):308–322, 2009

  22. [22]

    Sean P. Meyn. The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control, 69(12):8323–8337, 2024. doi:10.1109/TAC.2024.3409647

  23. [23]

    Rusu, Joel Veness, Marc G

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement...

  24. [24]

    Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

    Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, 1994

  25. [25]

    A note on the joint spectral radius.Indagationes Mathematicae, 22(4):379–381, 1960

    Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indagationes Mathematicae, 22(4):379–381, 1960

  26. [26]

    Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

    Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

  27. [27]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 1998

  28. [28]

    Tsitsiklis

    John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine Learning, 16(3):185–202, 1994

  29. [29]

    Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8(3):279–292, 1992

  30. [30]

    Breaking the deadly triad with a target network

    Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Breaking the deadly triad with a target network. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12621–12631. PMLR, 2021. Appendix Target Updates and Linear Q-Learning 25 A Tabular Case: PQVI, DLQL, andm-DLQL This appen...