Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version

Masoud S. Sakha; Rushikesh Kamalapurkar; Sean Meyn

arxiv: 2603.27874 · v2 · submitted 2026-03-29 · 💻 cs.LG · math.OC

Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version

Masoud S. Sakha , Rushikesh Kamalapurkar , Sean Meyn This is my paper

Pith reviewed 2026-05-14 21:22 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords relative TD learningstability analysislinear function approximationasymptotic biasasymptotic covariancediscount factorsensitivity analysisreinforcement learning

0 comments

The pith

Relative TD learning with empirical baseline stays stable for any discount factor under linear function approximation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes relative temporal-difference learning, which modifies the standard TD update by subtracting a baseline to address slow convergence as the discount factor nears one. It focuses on linear function approximation and proves that selecting the baseline as the empirical distribution of the observed state-action process guarantees stability for any non-negative baseline weight and any discount factor. The work also delivers a sensitivity analysis showing that both the asymptotic bias and covariance of the learned parameters remain uniformly bounded as the discount approaches one. This setup matters because conventional TD methods often lose stability or slow dramatically for long-horizon problems with high discounts.

Core claim

When the baseline distribution equals the empirical distribution of the state-action process, relative TD learning with linear function approximation remains stable for every non-negative baseline weight and every discount factor in [0,1). The asymptotic covariance matrix and bias vector of the parameter estimates stay uniformly bounded as the discount factor approaches one.

What carries the argument

The relative TD update rule that subtracts a baseline term from the temporal-difference error, with the baseline set exactly to the empirical distribution of state-action pairs collected along the algorithm's own trajectory.

If this is right

The algorithm converges to a stable fixed point even when the discount factor is arbitrarily close to one.
Asymptotic bias remains controlled and does not increase with larger discount factors.
Asymptotic covariance stays bounded uniformly, independent of how close the discount is to one.
Stability holds only when the baseline matches the empirical distribution under the current policy trajectory.
The sensitivity results quantify how parameter estimates change with discount factor without explosion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Online estimation of the empirical baseline from samples could be used in practice while preserving the stability guarantee.
Similar baseline-matching ideas might stabilize other TD variants or off-policy methods for long-horizon tasks.
The bounded-bias result suggests relative TD could be preferable to standard TD in applications where discount factors near one are required.
Extensions to average-reward or continuing-task settings may follow if the same empirical-baseline construction is retained.

Load-bearing premise

The proof relies on linear function approximation together with the baseline being exactly the empirical distribution generated by the algorithm's trajectory.

What would settle it

A simulation or analysis showing divergence or unbounded growth in bias when the baseline is chosen differently from the empirical distribution, or when nonlinear function approximation is used instead.

Figures

Figures reproduced from arXiv: 2603.27874 by Masoud S. Sakha, Rushikesh Kamalapurkar, Sean Meyn.

**Figure 2.** Figure 2: Eigenvalues as a function of discount factor [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Histograms of the scaled and centered parameter estimates for TD(0) and [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Bias realized vs theory [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Squared-norm of normalized bias as a function of [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Eigenvalues as a function of discount factor [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Histograms of the scaled and centered parameter estimates for TD(0) and [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Relative TD with linear FA is stable for empirical baselines at any gamma, with uniform bias/covariance bounds as gamma to 1, but the online coupled estimation of the baseline may not be fully handled.

read the letter

The paper's core contribution is showing stability of relative TD under linear function approximation when the baseline is the empirical distribution of the state-action process. This holds for any non-negative baseline weight and any discount factor, plus uniform boundedness of the asymptotic bias and covariance as gamma approaches 1. That is new relative to the tabular results it cites and directly targets the slow convergence issue in long-horizon RL. The analysis uses standard stochastic approximation arguments to get the mean-field ODE and the sensitivity results, which is appropriate for this setting. The claims on the role of the baseline choice are clearly stated and seem to follow from the derivations. One soft spot is the treatment of the baseline itself. The abstract and claims treat it as the exact empirical distribution, but the online algorithm must estimate it incrementally from the same trajectory. This adds martingale noise and slower mixing effects that grow with gamma, and it is not obvious whether those terms are propagated into the covariance bound. If the Lyapunov or covariance analysis freezes the baseline at its limit, the uniform bound could be fragile in the fully coupled system. The paper does not appear to include code or machine-checked proofs, so the details rest on the written derivations. This work is aimed at RL theorists who care about convergence rates and bias control under function approximation. A reader looking for concrete stability conditions on modified TD updates would find the baseline analysis useful. It deserves a serious referee because the central claims address a documented gap and are stated sharply enough to be checked.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes relative temporal-difference learning with linear function approximation. It establishes stability conditions showing that the algorithm is stable for any non-negative baseline weight and any discount factor when the baseline is chosen as the empirical distribution of the state-action process. A sensitivity analysis is provided that characterizes the asymptotic bias and covariance of the parameter estimates, with both quantities shown to remain uniformly bounded as the discount factor approaches one.

Significance. If the central claims hold, the results offer useful theoretical support for relative TD methods in the linear function approximation setting, addressing slow convergence issues near γ=1. The uniform boundedness of asymptotic quantities could inform algorithm design for stable learning in long-horizon problems.

major comments (2)

[§4] §4 (mean-field ODE and Lyapunov analysis): the stability argument treats the baseline distribution as fixed at its limiting empirical value when constructing the drift and Lyapunov function. This omits the additional martingale noise arising from incremental online estimation of the baseline from the same trajectory, which becomes more pronounced as γ → 1 due to slower mixing.
[§5.2] §5.2 (asymptotic covariance derivation): the closed-form expression for the asymptotic covariance does not incorporate the extra variance term from the coupled baseline estimator. Consequently, the claimed uniform boundedness as γ → 1 may not survive the online coupling, as the variance of the baseline estimator grows with the mixing time.

minor comments (1)

[§2] The distinction between the fixed limiting baseline distribution and its online estimator could be stated more explicitly in the problem formulation to avoid ambiguity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below.

read point-by-point responses

Referee: [§4] §4 (mean-field ODE and Lyapunov analysis): the stability argument treats the baseline distribution as fixed at its limiting empirical value when constructing the drift and Lyapunov function. This omits the additional martingale noise arising from incremental online estimation of the baseline from the same trajectory, which becomes more pronounced as γ → 1 due to slower mixing.

Authors: We agree that the mean-field ODE and Lyapunov analysis in Section 4 is performed with the baseline fixed at its limiting empirical distribution. This choice simplifies the drift term and enables the direct application of the Lyapunov function, but it does omit the martingale noise that arises when the baseline is estimated incrementally from the same trajectory. The referee is correct that this noise becomes more significant as γ approaches 1. We will revise Section 4 to include an extended analysis that couples the baseline estimator, augment the Lyapunov function with an additional term that absorbs the baseline noise, and show that global asymptotic stability is retained for any non-negative baseline weight and any discount factor. revision: yes
Referee: [§5.2] §5.2 (asymptotic covariance derivation): the closed-form expression for the asymptotic covariance does not incorporate the extra variance term from the coupled baseline estimator. Consequently, the claimed uniform boundedness as γ → 1 may not survive the online coupling, as the variance of the baseline estimator grows with the mixing time.

Authors: The closed-form covariance expression in Section 5.2 is derived under the fixed-baseline assumption. We acknowledge that the online coupling introduces an additional variance component whose magnitude scales with the mixing time. We will revise the sensitivity analysis to derive the joint asymptotic covariance of the coupled (parameter, baseline) system and verify that the uniform boundedness result as γ → 1 continues to hold; the extra terms remain controlled because the baseline is the empirical state-action distribution whose variance is bounded independently of γ. revision: yes

Circularity Check

0 steps flagged

Minor self-citation present but central stability derivation remains independent of inputs

full rationale

The paper derives stability of relative TD learning under linear function approximation via standard stochastic approximation arguments applied to the mean-field ODE, with the baseline chosen as the empirical state-action distribution serving as an explicit modeling assumption rather than a fitted or self-referential quantity. Asymptotic bias and covariance expressions are obtained from the linearized ODE and Lyapunov analysis without reducing to the target stability claim by construction. One minor self-citation appears for background on relative TD but is not load-bearing for the new stability or uniform-boundedness results as gamma approaches 1. No equations equate a prediction directly to a fitted parameter or prior self-result in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard MDP and stochastic approximation assumptions; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The environment is a Markov decision process and the value function is linearly approximated.
Standard assumption for TD analysis with function approximation.

pith-pipeline@v0.9.0 · 5435 in / 1125 out tokens · 38713 ms · 2026-05-14T21:22:31.255985+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ā(λ;δr) = Ā(λ;0) − δr/(1−λγ) ψ̄ψ̄ᵀ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Abounadi, D

J. Abounadi, D. Bertsekas, and V. S. Borkar. Learning algorithms for Markov decision pro- cesses with average cost.SIAM Journal on Control and Optimization, 40(3):681–698, 2001

work page 2001
[2]

Moulines and F

E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems 24, pages 451– 459, 2011

work page 2011
[3]

V. S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency, Delhi, India, 2nd edition, 2021

work page 2021
[4]

Borkar, S

V. Borkar, S. Chen, A. Devraj, I. Kontoyiannis, and S. Meyn. The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning.Annals of Applied Probability (preprint at arXiv e-prints:2110.14427), 35(2):936–982, April 2025

work page arXiv 2025
[5]

A. M. Devraj and S. P. Meyn. Zap Q-learning. InProc. of the Intl. Conference on Neural Information Processing Systems, pages 2232–2241, 2017

work page 2017
[6]

A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance.IEEE Trans. on Automatic Control, 67(11):5948–5963, 2022

work page 2022
[7]

Durmus, E

A. Durmus, E. Moulines, A. Naumov, and S. Samsonov. Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Op- erations Research, 2024

work page 2024
[8]

J. A. Fill. Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process.The Annals of Applied Probability, 1(1):62–87, 1991

work page 1991
[9]

C. K. Lauand and S. Meyn. Bias in stochastic approximation cannot be eliminated with averaging. InAllerton Conference on Communication, Control, and Computing, pages 1–4, Sep. 2022

work page 2022
[10]

D. A. Levin, Y. Peres, and E. L. Wilmer.Markov Chains and Mixing Times. American Mathematical Society, 2 edition, 2017

work page 2017
[11]

Mehta and S

P. Mehta and S. Meyn. Optimistic training and convergence of Q-learning – extended version. arXiv 2602.06146, 2026

work page arXiv 2026
[12]

Meyn.Control Systems and Reinforcement Learning

S. Meyn.Control Systems and Reinforcement Learning. Cambridge University Press, Cam- bridge, 2022

work page 2022
[13]

S. Meyn. The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control, 69(12):8323–8337, Dec 2024

work page 2024
[14]

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.SIAM J. Control Optim., 30(4):838–855, 1992

work page 1992
[15]

D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. 17

work page 1988
[16]

R. Srikant. Rates of convergence in the central limit theorem for markov chains, with an application to td learning.Math. Oper. Res., 2025

work page 2025
[17]

Srikant and L

R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and td learning. InProc. Mach. Learn. Res., volume 99, pages 2803–2830, 2019

work page 2019
[18]

Sutton and A

R. Sutton and A. Barto.Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018

work page 2018
[19]

Szepesv´ ari.Algorithms for Reinforcement Learning

C. Szepesv´ ari.Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intel- ligence and Machine Learning. Morgan & Claypool Publishers, 2010

work page 2010
[20]

J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation.IEEE Trans. Automat. Control, 42(5):674–690, 1997

work page 1997
[21]

C. J. C. H. Watkins.Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989

work page 1989
[22]

reversibilization

C. J. C. H. Watkins and P. Dayan.Q-learning.Machine Learning, 8(3-4):279–292, 1992. 18 A Appendices The mean flow ¯A:= ¯A(λ; 0) for the standard on-policy TD(λ) learning algorithm is linear, with ¯A=− ∞X k=0 (λγ)kRk +γ ∞X k=0 (λγ)kR(k+ 1) =−R(0) + (1−λ)γ ∞X k=0 (λγ)kR(k+ 1) (31) It was first established in [20] that ¯Ais Hurwitz whenR(0) is full rank. The...

work page 1992

[1] [1]

Abounadi, D

J. Abounadi, D. Bertsekas, and V. S. Borkar. Learning algorithms for Markov decision pro- cesses with average cost.SIAM Journal on Control and Optimization, 40(3):681–698, 2001

work page 2001

[2] [2]

Moulines and F

E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems 24, pages 451– 459, 2011

work page 2011

[3] [3]

V. S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency, Delhi, India, 2nd edition, 2021

work page 2021

[4] [4]

Borkar, S

V. Borkar, S. Chen, A. Devraj, I. Kontoyiannis, and S. Meyn. The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning.Annals of Applied Probability (preprint at arXiv e-prints:2110.14427), 35(2):936–982, April 2025

work page arXiv 2025

[5] [5]

A. M. Devraj and S. P. Meyn. Zap Q-learning. InProc. of the Intl. Conference on Neural Information Processing Systems, pages 2232–2241, 2017

work page 2017

[6] [6]

A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance.IEEE Trans. on Automatic Control, 67(11):5948–5963, 2022

work page 2022

[7] [7]

Durmus, E

A. Durmus, E. Moulines, A. Naumov, and S. Samsonov. Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Op- erations Research, 2024

work page 2024

[8] [8]

J. A. Fill. Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process.The Annals of Applied Probability, 1(1):62–87, 1991

work page 1991

[9] [9]

C. K. Lauand and S. Meyn. Bias in stochastic approximation cannot be eliminated with averaging. InAllerton Conference on Communication, Control, and Computing, pages 1–4, Sep. 2022

work page 2022

[10] [10]

D. A. Levin, Y. Peres, and E. L. Wilmer.Markov Chains and Mixing Times. American Mathematical Society, 2 edition, 2017

work page 2017

[11] [11]

Mehta and S

P. Mehta and S. Meyn. Optimistic training and convergence of Q-learning – extended version. arXiv 2602.06146, 2026

work page arXiv 2026

[12] [12]

Meyn.Control Systems and Reinforcement Learning

S. Meyn.Control Systems and Reinforcement Learning. Cambridge University Press, Cam- bridge, 2022

work page 2022

[13] [13]

S. Meyn. The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control, 69(12):8323–8337, Dec 2024

work page 2024

[14] [14]

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.SIAM J. Control Optim., 30(4):838–855, 1992

work page 1992

[15] [15]

D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. 17

work page 1988

[16] [16]

R. Srikant. Rates of convergence in the central limit theorem for markov chains, with an application to td learning.Math. Oper. Res., 2025

work page 2025

[17] [17]

Srikant and L

R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and td learning. InProc. Mach. Learn. Res., volume 99, pages 2803–2830, 2019

work page 2019

[18] [18]

Sutton and A

R. Sutton and A. Barto.Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018

work page 2018

[19] [19]

Szepesv´ ari.Algorithms for Reinforcement Learning

C. Szepesv´ ari.Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intel- ligence and Machine Learning. Morgan & Claypool Publishers, 2010

work page 2010

[20] [20]

J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation.IEEE Trans. Automat. Control, 42(5):674–690, 1997

work page 1997

[21] [21]

C. J. C. H. Watkins.Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989

work page 1989

[22] [22]

reversibilization

C. J. C. H. Watkins and P. Dayan.Q-learning.Machine Learning, 8(3-4):279–292, 1992. 18 A Appendices The mean flow ¯A:= ¯A(λ; 0) for the standard on-policy TD(λ) learning algorithm is linear, with ¯A=− ∞X k=0 (λγ)kRk +γ ∞X k=0 (λγ)kR(k+ 1) =−R(0) + (1−λ)γ ∞X k=0 (λγ)kR(k+ 1) (31) It was first established in [20] that ¯Ais Hurwitz whenR(0) is full rank. The...

work page 1992