Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

Donghwan Lee

arxiv: 2606.02645 · v1 · pith:O74PBKHInew · submitted 2026-05-31 · 📊 stat.ML · cs.AI· cs.LG

Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

Donghwan Lee This is my paper

Pith reviewed 2026-06-28 16:07 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords linear Q-learningtarget updatesconvergence analysisswitched linear systemsjoint spectral radiusreinforcement learningfunction approximation

0 comments

The pith

Periodic and soft target updates guarantee linear Q-learning convergence to the projected Q-Bellman solution under spectral and step-size conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear Q-learning with linear function approximation can fail to converge in general. This paper proves that periodic hard target updates and soft target updates stabilize the process and drive it to the exact fixed point of the projected Bellman equation. The argument models the combined updates as a switched linear system whose joint spectral radius must be less than one. Explicit conditions on the spectrum of the matrices and the learning rate ensure this radius stays below one. The deterministic analysis extends to the stochastic case once the mean recursion is shown to converge.

Core claim

Although linear Q-learning can fail to converge in general, under explicit spectral and step-size conditions, periodic hard target updates and soft target updates can guarantee convergence to the exact projected Q-Bellman solution. The main analysis uses the exact switched linear system dynamics induced by the Bellman maximum and certifies stability via the joint spectral radius of the resulting switching matrix families.

What carries the argument

Switched linear system dynamics induced by the Bellman maximum, with convergence certified by the joint spectral radius of the switching matrix families.

If this is right

The mean recursion of the deterministic system converges when the joint spectral radius condition holds.
The stochastic reinforcement-learning case follows from the mean convergence plus a separate noise analysis.
Both periodic hard target updates and soft target updates fall under the same switched-system certificate.
Convergence is to the exact projected Q-Bellman solution rather than an approximation of it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same switched-system lens could be applied to analyze other common stabilization tricks such as experience replay buffers.
Practitioners could compute or bound the joint spectral radius to select safe target-update periods before running experiments.
The deterministic analysis supplies a clear template for extending the argument to nonlinear function approximation if analogous spectral bounds can be obtained.

Load-bearing premise

The target-update mechanism is accurately captured by the exact switched linear system dynamics induced by the Bellman maximum.

What would settle it

A concrete linear Q-learning instance with periodic target updates where the joint spectral radius of the switched matrices is computed to be less than one, yet the iterates are observed to diverge.

Figures

Figures reproduced from arXiv: 2606.02645 by Donghwan Lee.

read the original abstract

Periodic target updates in Q-learning and soft target updates in actor-critic methods are empirically well established stabilization mechanisms, but their precise theoretical explanation is still incomplete. This paper gives a rigorous and exact analysis of these mechanisms for Q-learning with linear function approximation (linear Q-learning) using the exact switched linear system (SLS) dynamics induced by the Bellman maximum and the joint spectral radius (JSR) of the resulting switching matrix families. Although linear Q-learning can fail to converge in general, we prove that, under explicit spectral and step-size conditions, periodic hard target updates and soft target updates can guarantee convergence to the exact projected Q-Bellman solution. The main analysis is carried out for deterministic linear Q-learning, where the target-update mechanism is most transparent. Once the corresponding JSR certificate is established for the mean recursion, the stochastic reinforcement-learning setting can be treated by replacing deterministic modes with sampled stochastic modes and adding the corresponding stochastic-noise analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Target updates stabilize linear Q-learning under spectral conditions via switched system analysis.

read the letter

This paper shows that periodic and soft target updates can stabilize linear Q-learning under specific spectral and step-size conditions by modeling the process as a switched linear system whose joint spectral radius is less than one.

The new part is applying joint spectral radius analysis to the exact dynamics from the Bellman max for both hard periodic and soft updates. This gives an exact convergence proof to the projected fixed point in the deterministic setting, which prior work on linear Q-learning did not have. The approach is clean because it uses the finite modes induced by the max operator for finite actions, and then handles the stochastic case via the mean recursion with added noise analysis. Credit goes to the clear separation of the deterministic analysis from the stochastic extension.

The soft spots are minor. The result is conditional on the joint spectral radius being below one and on step sizes, so it does not claim to explain all cases where target updates help in practice. Checking those conditions for a specific problem might require computation, but the paper does not claim otherwise. The stochastic part is outlined rather than fully derived in the abstract, but the logic for the mean part appears sound and non-circular since it builds on external control results.

This work is for people studying convergence in approximate dynamic programming and reinforcement learning theory. Readers who want to see how control-theoretic tools can explain RL heuristics will get something from it. It is worth a serious referee because the modeling is consistent and the claim is precise, even though the full proofs would need verification for any hidden gaps.

I would recommend sending it out for peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that periodic hard target updates and soft target updates stabilize linear Q-learning with function approximation by inducing switched linear system dynamics from the Bellman maximum operator; under explicit conditions on the joint spectral radius (JSR) of the resulting matrix family and on step sizes, the system converges to the exact projected Q-Bellman fixed point. The deterministic case receives the main rigorous treatment via JSR certificates, with a sketch for the stochastic case via mean recursion plus noise analysis.

Significance. If the JSR-based certificates are valid, the work supplies the first explicit, non-asymptotic stability conditions explaining why target networks empirically stabilize Q-learning, moving beyond generic contraction arguments to control-theoretic switched-system analysis. The parameter-free nature of the JSR certificate (once the switching family is fixed) and the exact modeling of the Bellman max as finite modes for finite action spaces are particular strengths.

major comments (2)

[Abstract and §1 (stochastic extension paragraph)] The stochastic extension (mentioned in the abstract and §1) replaces deterministic modes with sampled stochastic modes and adds noise analysis after establishing the JSR certificate on the mean recursion, but no explicit error bounds, concentration inequalities, or statement of the precise sense of convergence (almost-sure, in expectation, or with high probability) are supplied; this leaves the claim that the stochastic setting “can be treated” unverified as a load-bearing step.
[§3 (switched linear system construction)] The modeling assumption that the Bellman maximum induces an exact finite family of linear modes (and thus a well-defined switching matrix family whose JSR can be computed or bounded) is stated without an explicit construction or enumeration of the modes for a concrete MDP; without this, it is impossible to check whether the claimed JSR < 1 conditions are non-vacuous or merely restate the fixed-point property.

minor comments (2)

[§4 (soft updates)] Notation for the augmented state in the soft-update case is introduced without a clear diagram or recurrence relating the target network parameters to the online parameters; a small example would clarify the dimension of the switched system.
[Introduction] The paper cites control-theoretic JSR results but does not compare the obtained stability conditions against existing contraction-mapping or Lyapunov analyses of target networks; a short related-work paragraph would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the JSR-based stability analysis and for the constructive major comments. We address each point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract and §1 (stochastic extension paragraph)] The stochastic extension (mentioned in the abstract and §1) replaces deterministic modes with sampled stochastic modes and adds noise analysis after establishing the JSR certificate on the mean recursion, but no explicit error bounds, concentration inequalities, or statement of the precise sense of convergence (almost-sure, in expectation, or with high probability) are supplied; this leaves the claim that the stochastic setting “can be treated” unverified as a load-bearing step.

Authors: We agree that the stochastic extension is only sketched at a high level. The primary rigorous contribution is the deterministic switched-linear-system analysis via JSR. The stochastic paragraph indicates that once the mean recursion is controlled by the JSR certificate, standard stochastic-approximation arguments (e.g., martingale noise terms vanishing under appropriate step-size conditions) can be applied, but no explicit bounds or convergence mode are derived. In revision we will (i) qualify the abstract and §1 to state that the stochastic case is indicated as a direct extension rather than fully developed, and (ii) add a short paragraph outlining the precise sense of convergence in expectation that follows from the mean recursion plus standard noise analysis, without claiming new concentration results. revision: partial
Referee: [§3 (switched linear system construction)] The modeling assumption that the Bellman maximum induces an exact finite family of linear modes (and thus a well-defined switching matrix family whose JSR can be computed or bounded) is stated without an explicit construction or enumeration of the modes for a concrete MDP; without this, it is impossible to check whether the claimed JSR < 1 conditions are non-vacuous or merely restate the fixed-point property.

Authors: The finite family arises because, with finite actions and linear features, the Bellman optimality operator (after projection) selects, at each step, one of |A| possible linear maps corresponding to the greedy action; the resulting matrix family is therefore finite. We acknowledge that §3 presents this construction abstractly without a worked numerical example. In the revision we will insert a short illustrative subsection (or appendix) that takes a small finite MDP, explicitly enumerates the |A| matrices in the switching family, computes or bounds their joint spectral radius, and verifies that JSR < 1 is a non-trivial condition that is independent of the fixed-point property itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external JSR theory on standard switched-system model

full rationale

The paper models target updates as inducing a finite family of linear modes from the Bellman max operator (standard for finite actions), then invokes the joint spectral radius of that family being <1 under explicit step-size/spectral conditions to certify convergence to the projected fixed point. This is a direct application of existing control-theoretic results to the deterministic mean recursion, followed by a separate stochastic-noise argument; no parameter is fitted inside the paper and then renamed as a prediction, no self-citation supplies the uniqueness or convergence certificate, and the modeling assumptions are stated explicitly rather than smuggled. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard mathematical properties of the joint spectral radius for switched linear systems and on the modeling assumption that Q-learning target updates induce a switched linear system whose stability is governed by that radius.

axioms (1)

standard math Joint spectral radius of a matrix family less than one implies asymptotic stability of the corresponding switched linear system
Standard result from switched systems theory invoked to certify convergence.

pith-pipeline@v0.9.1-grok · 5690 in / 1213 out tokens · 23292 ms · 2026-06-28T16:07:42.101146+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geometrically Averaged Hard Target Updates for Linear Q-Learning
cs.LG 2026-06 unverdicted novelty 6.0

Introduces and analyzes the λ-target update for linear Q-learning via geometric averaging of periodic target maps, studied with a switching-system model in the deterministic case.

Reference graph

Works this paper leans on

30 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. Target Updates and Linear Q-Learning 23

1996
[2]

Blondel and Yurii Nesterov

Vincent D. Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

2005
[3]

Borkar and Sean P

Vivek S. Borkar and Sean P. Meyn. The ODE method for convergence of stochastic approxima- tion and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

2000
[4]

Ramirez, Christopher K

Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurmans. Target networks and over- parameterization stabilize off-policy bootstrapping with function approximation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine...

2024
[5]

Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023. doi:10.1137/22M1499261

work page doi:10.1137/22m1499261 2023
[6]

A note on the joint spectral radius

Antonio Cicone. A note on the joint spectral radius. arXiv preprint arXiv:1502.01506, 2015

Pith/arXiv arXiv 2015
[7]

Mattie Fellows, Matthew J. A. Smith, and Shimon Whiteson. Why target networks stabilise temporal difference methods. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 9886–9909. PMLR, 2023

2023
[8]

Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011

Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011

2011
[9]

Convergence of stochastic iterative dynamic programming algorithms.Advances in Neural Information Processing Systems, 6, 1993

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in Neural Information Processing Systems, 6, 1993

1993
[10]

Continuity of the joint spectral radius: application to wavelets

Christopher Heil and Gilbert Strang. Continuity of the joint spectral radius: application to wavelets. In A. Bojanczyk and G. Cybenko, editors,Linear Algebra for Signal Processing, volume 69 ofThe IMA Volumes in Mathematics and its Applications, pages 51–61. Springer, 1995

1995
[11]

Springer, volume 385, 2009

Raphaël Jungers.The Joint Spectral Radius: Theory and Applications. Springer, volume 385, 2009

2009
[12]

John Wiley & Sons, New York, 1978

Erwin Kreyszig.Introductory Functional Analysis with Applications. John Wiley & Sons, New York, 1978

1978
[13]

Lyapunov-certified direct switching theory for Q-learning

Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning. arXiv preprint arXiv:2604.19569, 2026

Pith/arXiv arXiv 2026
[14]

Target-based temporal-difference learning

Donghwan Lee and Niao He. Target-based temporal-difference learning. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3713–3722. PMLR, 2019

2019
[15]

Periodic Q-learning

Donghwan Lee and Niao He. Periodic Q-learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 ofProceedings of Machine Learning Research, pages 582–598. PMLR, 2020. Target Updates and Linear Q-Learning 24

2020
[16]

A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

2023
[17]

A switching system theory of Q-learning with linear function approximation

Donghwan Lee and Han-Dong Lim. A switching system theory of Q-learning with linear function approximation. arXiv preprint arXiv:2605.11021, 2026.https://arxiv.org/pdf/2605.11021

Pith/arXiv arXiv 2026
[18]

Springer Science & Business Media, 2003

Daniel Liberzon.Switching in Systems and Control. Springer Science & Business Media, 2003

2003
[19]

Lillicrap, Jonathan J

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learn- ing. InProceedings of the 4th International Conference on Learning Representations, 2016. arXiv:1509.02971

Pith/arXiv arXiv 2016
[20]

Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration

Han-Dong Lim and Donghwan Lee. Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration. arXiv preprint arXiv:2504.10865, 2025

arXiv 2025
[21]

Antsaklis

Hai Lin and Panos J. Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic Control, 54(2):308–322, 2009

2009
[22]

Sean P. Meyn. The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control, 69(12):8323–8337, 2024. doi:10.1109/TAC.2024.3409647

work page doi:10.1109/tac.2024.3409647 2024
[23]

Rusu, Joel Veness, Marc G

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement...

2015
[24]

Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, 1994

1994
[25]

A note on the joint spectral radius.Indagationes Mathematicae, 22(4):379–381, 1960

Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indagationes Mathematicae, 22(4):379–381, 1960

1960
[26]

Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

2007
[27]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 1998

1998
[28]

Tsitsiklis

John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine Learning, 16(3):185–202, 1994

1994
[29]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8(3):279–292, 1992

1992
[30]

Breaking the deadly triad with a target network

Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Breaking the deadly triad with a target network. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12621–12631. PMLR, 2021. Appendix Target Updates and Linear Q-Learning 25 A Tabular Case: PQVI, DLQL, andm-DLQL This appen...

2021

[1] [1]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. Target Updates and Linear Q-Learning 23

1996

[2] [2]

Blondel and Yurii Nesterov

Vincent D. Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

2005

[3] [3]

Borkar and Sean P

Vivek S. Borkar and Sean P. Meyn. The ODE method for convergence of stochastic approxima- tion and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

2000

[4] [4]

Ramirez, Christopher K

Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurmans. Target networks and over- parameterization stabilize off-policy bootstrapping with function approximation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine...

2024

[5] [5]

Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023. doi:10.1137/22M1499261

work page doi:10.1137/22m1499261 2023

[6] [6]

A note on the joint spectral radius

Antonio Cicone. A note on the joint spectral radius. arXiv preprint arXiv:1502.01506, 2015

Pith/arXiv arXiv 2015

[7] [7]

Mattie Fellows, Matthew J. A. Smith, and Shimon Whiteson. Why target networks stabilise temporal difference methods. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 9886–9909. PMLR, 2023

2023

[8] [8]

Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011

Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE Transactions on Automatic Control, 56(5):1059–1074, 2011

2011

[9] [9]

Convergence of stochastic iterative dynamic programming algorithms.Advances in Neural Information Processing Systems, 6, 1993

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in Neural Information Processing Systems, 6, 1993

1993

[10] [10]

Continuity of the joint spectral radius: application to wavelets

Christopher Heil and Gilbert Strang. Continuity of the joint spectral radius: application to wavelets. In A. Bojanczyk and G. Cybenko, editors,Linear Algebra for Signal Processing, volume 69 ofThe IMA Volumes in Mathematics and its Applications, pages 51–61. Springer, 1995

1995

[11] [11]

Springer, volume 385, 2009

Raphaël Jungers.The Joint Spectral Radius: Theory and Applications. Springer, volume 385, 2009

2009

[12] [12]

John Wiley & Sons, New York, 1978

Erwin Kreyszig.Introductory Functional Analysis with Applications. John Wiley & Sons, New York, 1978

1978

[13] [13]

Lyapunov-certified direct switching theory for Q-learning

Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning. arXiv preprint arXiv:2604.19569, 2026

Pith/arXiv arXiv 2026

[14] [14]

Target-based temporal-difference learning

Donghwan Lee and Niao He. Target-based temporal-difference learning. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3713–3722. PMLR, 2019

2019

[15] [15]

Periodic Q-learning

Donghwan Lee and Niao He. Periodic Q-learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 ofProceedings of Machine Learning Research, pages 582–598. PMLR, 2020. Target Updates and Linear Q-Learning 24

2020

[16] [16]

A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

2023

[17] [17]

A switching system theory of Q-learning with linear function approximation

Donghwan Lee and Han-Dong Lim. A switching system theory of Q-learning with linear function approximation. arXiv preprint arXiv:2605.11021, 2026.https://arxiv.org/pdf/2605.11021

Pith/arXiv arXiv 2026

[18] [18]

Springer Science & Business Media, 2003

Daniel Liberzon.Switching in Systems and Control. Springer Science & Business Media, 2003

2003

[19] [19]

Lillicrap, Jonathan J

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learn- ing. InProceedings of the 4th International Conference on Learning Representations, 2016. arXiv:1509.02971

Pith/arXiv arXiv 2016

[20] [20]

Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration

Han-Dong Lim and Donghwan Lee. Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration. arXiv preprint arXiv:2504.10865, 2025

arXiv 2025

[21] [21]

Antsaklis

Hai Lin and Panos J. Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic Control, 54(2):308–322, 2009

2009

[22] [22]

Sean P. Meyn. The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control, 69(12):8323–8337, 2024. doi:10.1109/TAC.2024.3409647

work page doi:10.1109/tac.2024.3409647 2024

[23] [23]

Rusu, Joel Veness, Marc G

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement...

2015

[24] [24]

Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, 1994

1994

[25] [25]

A note on the joint spectral radius.Indagationes Mathematicae, 22(4):379–381, 1960

Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indagationes Mathematicae, 22(4):379–381, 1960

1960

[26] [26]

Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

2007

[27] [27]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 1998

1998

[28] [28]

Tsitsiklis

John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine Learning, 16(3):185–202, 1994

1994

[29] [29]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8(3):279–292, 1992

1992

[30] [30]

Breaking the deadly triad with a target network

Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Breaking the deadly triad with a target network. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12621–12631. PMLR, 2021. Appendix Target Updates and Linear Q-Learning 25 A Tabular Case: PQVI, DLQL, andm-DLQL This appen...

2021