Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version
Pith reviewed 2026-05-14 21:22 UTC · model grok-4.3
The pith
Relative TD learning with empirical baseline stays stable for any discount factor under linear function approximation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the baseline distribution equals the empirical distribution of the state-action process, relative TD learning with linear function approximation remains stable for every non-negative baseline weight and every discount factor in [0,1). The asymptotic covariance matrix and bias vector of the parameter estimates stay uniformly bounded as the discount factor approaches one.
What carries the argument
The relative TD update rule that subtracts a baseline term from the temporal-difference error, with the baseline set exactly to the empirical distribution of state-action pairs collected along the algorithm's own trajectory.
If this is right
- The algorithm converges to a stable fixed point even when the discount factor is arbitrarily close to one.
- Asymptotic bias remains controlled and does not increase with larger discount factors.
- Asymptotic covariance stays bounded uniformly, independent of how close the discount is to one.
- Stability holds only when the baseline matches the empirical distribution under the current policy trajectory.
- The sensitivity results quantify how parameter estimates change with discount factor without explosion.
Where Pith is reading between the lines
- Online estimation of the empirical baseline from samples could be used in practice while preserving the stability guarantee.
- Similar baseline-matching ideas might stabilize other TD variants or off-policy methods for long-horizon tasks.
- The bounded-bias result suggests relative TD could be preferable to standard TD in applications where discount factors near one are required.
- Extensions to average-reward or continuing-task settings may follow if the same empirical-baseline construction is retained.
Load-bearing premise
The proof relies on linear function approximation together with the baseline being exactly the empirical distribution generated by the algorithm's trajectory.
What would settle it
A simulation or analysis showing divergence or unbounded growth in bias when the baseline is chosen differently from the empirical distribution, or when nonlinear function approximation is used instead.
Figures
read the original abstract
Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes relative temporal-difference learning with linear function approximation. It establishes stability conditions showing that the algorithm is stable for any non-negative baseline weight and any discount factor when the baseline is chosen as the empirical distribution of the state-action process. A sensitivity analysis is provided that characterizes the asymptotic bias and covariance of the parameter estimates, with both quantities shown to remain uniformly bounded as the discount factor approaches one.
Significance. If the central claims hold, the results offer useful theoretical support for relative TD methods in the linear function approximation setting, addressing slow convergence issues near γ=1. The uniform boundedness of asymptotic quantities could inform algorithm design for stable learning in long-horizon problems.
major comments (2)
- [§4] §4 (mean-field ODE and Lyapunov analysis): the stability argument treats the baseline distribution as fixed at its limiting empirical value when constructing the drift and Lyapunov function. This omits the additional martingale noise arising from incremental online estimation of the baseline from the same trajectory, which becomes more pronounced as γ → 1 due to slower mixing.
- [§5.2] §5.2 (asymptotic covariance derivation): the closed-form expression for the asymptotic covariance does not incorporate the extra variance term from the coupled baseline estimator. Consequently, the claimed uniform boundedness as γ → 1 may not survive the online coupling, as the variance of the baseline estimator grows with the mixing time.
minor comments (1)
- [§2] The distinction between the fixed limiting baseline distribution and its online estimator could be stated more explicitly in the problem formulation to avoid ambiguity for readers.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [§4] §4 (mean-field ODE and Lyapunov analysis): the stability argument treats the baseline distribution as fixed at its limiting empirical value when constructing the drift and Lyapunov function. This omits the additional martingale noise arising from incremental online estimation of the baseline from the same trajectory, which becomes more pronounced as γ → 1 due to slower mixing.
Authors: We agree that the mean-field ODE and Lyapunov analysis in Section 4 is performed with the baseline fixed at its limiting empirical distribution. This choice simplifies the drift term and enables the direct application of the Lyapunov function, but it does omit the martingale noise that arises when the baseline is estimated incrementally from the same trajectory. The referee is correct that this noise becomes more significant as γ approaches 1. We will revise Section 4 to include an extended analysis that couples the baseline estimator, augment the Lyapunov function with an additional term that absorbs the baseline noise, and show that global asymptotic stability is retained for any non-negative baseline weight and any discount factor. revision: yes
-
Referee: [§5.2] §5.2 (asymptotic covariance derivation): the closed-form expression for the asymptotic covariance does not incorporate the extra variance term from the coupled baseline estimator. Consequently, the claimed uniform boundedness as γ → 1 may not survive the online coupling, as the variance of the baseline estimator grows with the mixing time.
Authors: The closed-form covariance expression in Section 5.2 is derived under the fixed-baseline assumption. We acknowledge that the online coupling introduces an additional variance component whose magnitude scales with the mixing time. We will revise the sensitivity analysis to derive the joint asymptotic covariance of the coupled (parameter, baseline) system and verify that the uniform boundedness result as γ → 1 continues to hold; the extra terms remain controlled because the baseline is the empirical state-action distribution whose variance is bounded independently of γ. revision: yes
Circularity Check
Minor self-citation present but central stability derivation remains independent of inputs
full rationale
The paper derives stability of relative TD learning under linear function approximation via standard stochastic approximation arguments applied to the mean-field ODE, with the baseline chosen as the empirical state-action distribution serving as an explicit modeling assumption rather than a fitted or self-referential quantity. Asymptotic bias and covariance expressions are obtained from the linearized ODE and Lyapunov analysis without reducing to the target stability claim by construction. One minor self-citation appears for background on relative TD but is not load-bearing for the new stability or uniform-boundedness results as gamma approaches 1. No equations equate a prediction directly to a fitted parameter or prior self-result in a circular manner.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The environment is a Markov decision process and the value function is linearly approximated.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Ā(λ;δr) = Ā(λ;0) − δr/(1−λγ) ψ̄ψ̄ᵀ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
J. Abounadi, D. Bertsekas, and V. S. Borkar. Learning algorithms for Markov decision pro- cesses with average cost.SIAM Journal on Control and Optimization, 40(3):681–698, 2001
work page 2001
-
[2]
E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems 24, pages 451– 459, 2011
work page 2011
-
[3]
V. S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency, Delhi, India, 2nd edition, 2021
work page 2021
- [4]
-
[5]
A. M. Devraj and S. P. Meyn. Zap Q-learning. InProc. of the Intl. Conference on Neural Information Processing Systems, pages 2232–2241, 2017
work page 2017
-
[6]
A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance.IEEE Trans. on Automatic Control, 67(11):5948–5963, 2022
work page 2022
- [7]
-
[8]
J. A. Fill. Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process.The Annals of Applied Probability, 1(1):62–87, 1991
work page 1991
-
[9]
C. K. Lauand and S. Meyn. Bias in stochastic approximation cannot be eliminated with averaging. InAllerton Conference on Communication, Control, and Computing, pages 1–4, Sep. 2022
work page 2022
-
[10]
D. A. Levin, Y. Peres, and E. L. Wilmer.Markov Chains and Mixing Times. American Mathematical Society, 2 edition, 2017
work page 2017
-
[11]
P. Mehta and S. Meyn. Optimistic training and convergence of Q-learning – extended version. arXiv 2602.06146, 2026
-
[12]
Meyn.Control Systems and Reinforcement Learning
S. Meyn.Control Systems and Reinforcement Learning. Cambridge University Press, Cam- bridge, 2022
work page 2022
-
[13]
S. Meyn. The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control, 69(12):8323–8337, Dec 2024
work page 2024
-
[14]
B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.SIAM J. Control Optim., 30(4):838–855, 1992
work page 1992
-
[15]
D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. 17
work page 1988
-
[16]
R. Srikant. Rates of convergence in the central limit theorem for markov chains, with an application to td learning.Math. Oper. Res., 2025
work page 2025
-
[17]
R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and td learning. InProc. Mach. Learn. Res., volume 99, pages 2803–2830, 2019
work page 2019
-
[18]
R. Sutton and A. Barto.Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018
work page 2018
-
[19]
Szepesv´ ari.Algorithms for Reinforcement Learning
C. Szepesv´ ari.Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intel- ligence and Machine Learning. Morgan & Claypool Publishers, 2010
work page 2010
-
[20]
J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation.IEEE Trans. Automat. Control, 42(5):674–690, 1997
work page 1997
-
[21]
C. J. C. H. Watkins.Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989
work page 1989
-
[22]
C. J. C. H. Watkins and P. Dayan.Q-learning.Machine Learning, 8(3-4):279–292, 1992. 18 A Appendices The mean flow ¯A:= ¯A(λ; 0) for the standard on-policy TD(λ) learning algorithm is linear, with ¯A=− ∞X k=0 (λγ)kRk +γ ∞X k=0 (λγ)kR(k+ 1) =−R(0) + (1−λ)γ ∞X k=0 (λγ)kR(k+ 1) (31) It was first established in [20] that ¯Ais Hurwitz whenR(0) is full rank. The...
work page 1992
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.