pith. sign in

arxiv: 2603.27874 · v2 · submitted 2026-03-29 · 💻 cs.LG · math.OC

Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version

Pith reviewed 2026-05-14 21:22 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords relative TD learningstability analysislinear function approximationasymptotic biasasymptotic covariancediscount factorsensitivity analysisreinforcement learning
0
0 comments X

The pith

Relative TD learning with empirical baseline stays stable for any discount factor under linear function approximation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes relative temporal-difference learning, which modifies the standard TD update by subtracting a baseline to address slow convergence as the discount factor nears one. It focuses on linear function approximation and proves that selecting the baseline as the empirical distribution of the observed state-action process guarantees stability for any non-negative baseline weight and any discount factor. The work also delivers a sensitivity analysis showing that both the asymptotic bias and covariance of the learned parameters remain uniformly bounded as the discount approaches one. This setup matters because conventional TD methods often lose stability or slow dramatically for long-horizon problems with high discounts.

Core claim

When the baseline distribution equals the empirical distribution of the state-action process, relative TD learning with linear function approximation remains stable for every non-negative baseline weight and every discount factor in [0,1). The asymptotic covariance matrix and bias vector of the parameter estimates stay uniformly bounded as the discount factor approaches one.

What carries the argument

The relative TD update rule that subtracts a baseline term from the temporal-difference error, with the baseline set exactly to the empirical distribution of state-action pairs collected along the algorithm's own trajectory.

If this is right

  • The algorithm converges to a stable fixed point even when the discount factor is arbitrarily close to one.
  • Asymptotic bias remains controlled and does not increase with larger discount factors.
  • Asymptotic covariance stays bounded uniformly, independent of how close the discount is to one.
  • Stability holds only when the baseline matches the empirical distribution under the current policy trajectory.
  • The sensitivity results quantify how parameter estimates change with discount factor without explosion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Online estimation of the empirical baseline from samples could be used in practice while preserving the stability guarantee.
  • Similar baseline-matching ideas might stabilize other TD variants or off-policy methods for long-horizon tasks.
  • The bounded-bias result suggests relative TD could be preferable to standard TD in applications where discount factors near one are required.
  • Extensions to average-reward or continuing-task settings may follow if the same empirical-baseline construction is retained.

Load-bearing premise

The proof relies on linear function approximation together with the baseline being exactly the empirical distribution generated by the algorithm's trajectory.

What would settle it

A simulation or analysis showing divergence or unbounded growth in bias when the baseline is chosen differently from the empirical distribution, or when nonlinear function approximation is used instead.

Figures

Figures reproduced from arXiv: 2603.27874 by Masoud S. Sakha, Rushikesh Kamalapurkar, Sean Meyn.

Figure 1
Figure 1. Figure 1: 3-state 2-action MDP model, with randomized policy. Variants of TD learning were [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Eigenvalues as a function of discount factor [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Histograms of the scaled and centered parameter estimates for TD(0) and [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bias realized vs theory [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Squared-norm of normalized bias as a function of [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Eigenvalues as a function of discount factor [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Histograms of the scaled and centered parameter estimates for TD(0) and [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes relative temporal-difference learning with linear function approximation. It establishes stability conditions showing that the algorithm is stable for any non-negative baseline weight and any discount factor when the baseline is chosen as the empirical distribution of the state-action process. A sensitivity analysis is provided that characterizes the asymptotic bias and covariance of the parameter estimates, with both quantities shown to remain uniformly bounded as the discount factor approaches one.

Significance. If the central claims hold, the results offer useful theoretical support for relative TD methods in the linear function approximation setting, addressing slow convergence issues near γ=1. The uniform boundedness of asymptotic quantities could inform algorithm design for stable learning in long-horizon problems.

major comments (2)
  1. [§4] §4 (mean-field ODE and Lyapunov analysis): the stability argument treats the baseline distribution as fixed at its limiting empirical value when constructing the drift and Lyapunov function. This omits the additional martingale noise arising from incremental online estimation of the baseline from the same trajectory, which becomes more pronounced as γ → 1 due to slower mixing.
  2. [§5.2] §5.2 (asymptotic covariance derivation): the closed-form expression for the asymptotic covariance does not incorporate the extra variance term from the coupled baseline estimator. Consequently, the claimed uniform boundedness as γ → 1 may not survive the online coupling, as the variance of the baseline estimator grows with the mixing time.
minor comments (1)
  1. [§2] The distinction between the fixed limiting baseline distribution and its online estimator could be stated more explicitly in the problem formulation to avoid ambiguity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [§4] §4 (mean-field ODE and Lyapunov analysis): the stability argument treats the baseline distribution as fixed at its limiting empirical value when constructing the drift and Lyapunov function. This omits the additional martingale noise arising from incremental online estimation of the baseline from the same trajectory, which becomes more pronounced as γ → 1 due to slower mixing.

    Authors: We agree that the mean-field ODE and Lyapunov analysis in Section 4 is performed with the baseline fixed at its limiting empirical distribution. This choice simplifies the drift term and enables the direct application of the Lyapunov function, but it does omit the martingale noise that arises when the baseline is estimated incrementally from the same trajectory. The referee is correct that this noise becomes more significant as γ approaches 1. We will revise Section 4 to include an extended analysis that couples the baseline estimator, augment the Lyapunov function with an additional term that absorbs the baseline noise, and show that global asymptotic stability is retained for any non-negative baseline weight and any discount factor. revision: yes

  2. Referee: [§5.2] §5.2 (asymptotic covariance derivation): the closed-form expression for the asymptotic covariance does not incorporate the extra variance term from the coupled baseline estimator. Consequently, the claimed uniform boundedness as γ → 1 may not survive the online coupling, as the variance of the baseline estimator grows with the mixing time.

    Authors: The closed-form covariance expression in Section 5.2 is derived under the fixed-baseline assumption. We acknowledge that the online coupling introduces an additional variance component whose magnitude scales with the mixing time. We will revise the sensitivity analysis to derive the joint asymptotic covariance of the coupled (parameter, baseline) system and verify that the uniform boundedness result as γ → 1 continues to hold; the extra terms remain controlled because the baseline is the empirical state-action distribution whose variance is bounded independently of γ. revision: yes

Circularity Check

0 steps flagged

Minor self-citation present but central stability derivation remains independent of inputs

full rationale

The paper derives stability of relative TD learning under linear function approximation via standard stochastic approximation arguments applied to the mean-field ODE, with the baseline chosen as the empirical state-action distribution serving as an explicit modeling assumption rather than a fitted or self-referential quantity. Asymptotic bias and covariance expressions are obtained from the linearized ODE and Lyapunov analysis without reducing to the target stability claim by construction. One minor self-citation appears for background on relative TD but is not load-bearing for the new stability or uniform-boundedness results as gamma approaches 1. No equations equate a prediction directly to a fitted parameter or prior self-result in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard MDP and stochastic approximation assumptions; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The environment is a Markov decision process and the value function is linearly approximated.
    Standard assumption for TD analysis with function approximation.

pith-pipeline@v0.9.0 · 5435 in / 1125 out tokens · 38713 ms · 2026-05-14T21:22:31.255985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Abounadi, D

    J. Abounadi, D. Bertsekas, and V. S. Borkar. Learning algorithms for Markov decision pro- cesses with average cost.SIAM Journal on Control and Optimization, 40(3):681–698, 2001

  2. [2]

    Moulines and F

    E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems 24, pages 451– 459, 2011

  3. [3]

    V. S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency, Delhi, India, 2nd edition, 2021

  4. [4]

    Borkar, S

    V. Borkar, S. Chen, A. Devraj, I. Kontoyiannis, and S. Meyn. The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning.Annals of Applied Probability (preprint at arXiv e-prints:2110.14427), 35(2):936–982, April 2025

  5. [5]

    A. M. Devraj and S. P. Meyn. Zap Q-learning. InProc. of the Intl. Conference on Neural Information Processing Systems, pages 2232–2241, 2017

  6. [6]

    A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance.IEEE Trans. on Automatic Control, 67(11):5948–5963, 2022

  7. [7]

    Durmus, E

    A. Durmus, E. Moulines, A. Naumov, and S. Samsonov. Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Op- erations Research, 2024

  8. [8]

    J. A. Fill. Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process.The Annals of Applied Probability, 1(1):62–87, 1991

  9. [9]

    C. K. Lauand and S. Meyn. Bias in stochastic approximation cannot be eliminated with averaging. InAllerton Conference on Communication, Control, and Computing, pages 1–4, Sep. 2022

  10. [10]

    D. A. Levin, Y. Peres, and E. L. Wilmer.Markov Chains and Mixing Times. American Mathematical Society, 2 edition, 2017

  11. [11]

    Mehta and S

    P. Mehta and S. Meyn. Optimistic training and convergence of Q-learning – extended version. arXiv 2602.06146, 2026

  12. [12]

    Meyn.Control Systems and Reinforcement Learning

    S. Meyn.Control Systems and Reinforcement Learning. Cambridge University Press, Cam- bridge, 2022

  13. [13]

    S. Meyn. The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control, 69(12):8323–8337, Dec 2024

  14. [14]

    B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.SIAM J. Control Optim., 30(4):838–855, 1992

  15. [15]

    D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. 17

  16. [16]

    R. Srikant. Rates of convergence in the central limit theorem for markov chains, with an application to td learning.Math. Oper. Res., 2025

  17. [17]

    Srikant and L

    R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and td learning. InProc. Mach. Learn. Res., volume 99, pages 2803–2830, 2019

  18. [18]

    Sutton and A

    R. Sutton and A. Barto.Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018

  19. [19]

    Szepesv´ ari.Algorithms for Reinforcement Learning

    C. Szepesv´ ari.Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intel- ligence and Machine Learning. Morgan & Claypool Publishers, 2010

  20. [20]

    J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation.IEEE Trans. Automat. Control, 42(5):674–690, 1997

  21. [21]

    C. J. C. H. Watkins.Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989

  22. [22]

    reversibilization

    C. J. C. H. Watkins and P. Dayan.Q-learning.Machine Learning, 8(3-4):279–292, 1992. 18 A Appendices The mean flow ¯A:= ¯A(λ; 0) for the standard on-policy TD(λ) learning algorithm is linear, with ¯A=− ∞X k=0 (λγ)kRk +γ ∞X k=0 (λγ)kR(k+ 1) =−R(0) + (1−λ)γ ∞X k=0 (λγ)kR(k+ 1) (31) It was first established in [20] that ¯Ais Hurwitz whenR(0) is full rank. The...