Regularized Centered Emphatic Temporal Difference Learning

Chaohui Wu; Chao Li; Guang Yang; Jinguo Ye; Shangdong Yang; Tianyu Liang; Wenhao Wang; Xingguo Chen

arxiv: 2605.04100 · v1 · submitted 2026-05-02 · 💻 cs.AI

Regularized Centered Emphatic Temporal Difference Learning

Xingguo Chen , Chaohui Wu , Jinguo Ye , Chao Li , Shangdong Yang , Guang Yang , Tianyu Liang , Wenhao Wang This is my paper

Pith reviewed 2026-05-09 14:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords off-policy learningtemporal difference learningemphatic TDregularizationBellman error centeringreinforcement learningfunction approximationstability

0 comments

The pith

Regularized emphatic TD learning stabilizes off-policy updates by regularizing only the auxiliary centering recursion while preserving the follow-on trace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Off-policy temporal-difference learning with function approximation must balance stability against projection geometry and variance. Emphatic TD improves geometry through follow-on emphasis but suffers high variance in its trace. Introducing Bellman-error centering removes a common drift term yet creates an auxiliary coupling that can destroy positive-definiteness of the emphatic key matrix. The paper shows that regularizing solely the centering recursion by lifting the lower-right block of the coupled matrix from 1 to 1+c restores convergence without discarding the emphatic geometry. Diagnostic linear off-policy prediction experiments confirm that the resulting method avoids instability of naive centering and maintains a robust regime for the regularization parameter.

Core claim

The core discovery is that Regularized Emphatic Temporal-Difference Learning (RETD) preserves the follow-on trace, regularizes only the auxiliary centering recursion by lifting the lower-right block of the coupled key matrix from 1 to 1+c, yields a derived RETD core matrix that remains positive definite under a conservative sufficient condition on c, converges, and retains favorable emphatic geometry on linear off-policy prediction tasks.

What carries the argument

The RETD core matrix formed by lifting the lower-right block of the ETD key matrix from 1 to 1+c, which regularizes the auxiliary centering recursion while leaving the follow-on trace untouched.

If this is right

RETD converges under the conservative sufficient regularization condition on c.
RETD avoids the instability observed in naive centered emphatic learning.
RETD preserves the favorable projection geometry of emphatic methods on linear off-policy prediction tasks.
An intermediate range of c yields robust performance across the diagnostic tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective regularization pattern may extend to other TD variants that combine emphasis with auxiliary recursions.
Practical selection of c could be guided by monitoring matrix eigenvalues during learning rather than relying solely on the conservative bound.
The approach suggests a general template for stabilizing coupled linear recursions by regularizing only the destabilizing block.

Load-bearing premise

A conservative sufficient condition on the regularization parameter c is strong enough to guarantee positive-definiteness and convergence without erasing the geometric benefits of emphasis.

What would settle it

Run the diagnostic linear off-policy tasks with values of c below the stated sufficient condition and check whether the key matrix loses positive definiteness or the iterates diverge.

Figures

Figures reproduced from arXiv: 2605.04100 by Chaohui Wu, Chao Li, Guang Yang, Jinguo Ye, Shangdong Yang, Tianyu Liang, Wenhao Wang, Xingguo Chen.

**Figure 1.** Figure 1: A new two-state counterexample for CETD. view at source ↗

**Figure 2.** Figure 2: Main diagnostic comparisons at α = 0.01. Panel (a) shows the geometry diagnostic on Boyan chain; panel (b) shows the off-policy stability diagnostic on Baird, where ETD and TETD trajectories are numerically unusable at this stepsize and are reported in view at source ↗

**Figure 3.** Figure 3: Complete algorithm comparisons at α = 0.01 across all seven environments. The main text retains only the Boyan-chain and Baird panels; the remaining panels are reported here so that the main-text selection can be verified against the full coverage. 18 view at source ↗

**Figure 4.** Figure 4: Algorithm comparisons at α = 0.005. The smaller stepsize compresses the differences between methods while preserving their ordering relative to view at source ↗

**Figure 5.** Figure 5: Algorithm comparisons at α = 0.05. The larger stepsize amplifies the sensitivity of emphatic-trace methods and makes the stability gap between RETD and the naive emphatic methods more visible. 20 view at source ↗

**Figure 6.** Figure 6: RETD c-scan at fixed α = 0.01. Small c approaches CETD; very large c damps the auxiliary recursion and approaches ETD; intermediate values deliver stable centered emphatic learning. 21 view at source ↗

**Figure 7.** Figure 7: RETD learning-rate scan at the environment-specific regularization values used in the main view at source ↗

read the original abstract

Off-policy temporal-difference (TD) learning with function approximation faces a structural tradeoff among stability, projection geometry, and variance control. Emphatic TD (ETD) improves the off-policy projection geometry through follow-on emphasis, but the follow-on trace can have high variance. We revisit this tradeoff through Bellman-error centering. Although centering naturally removes a common drift term from TD errors, we show that a naive centered emphatic extension introduces an auxiliary coupling that can destroy the positive-definiteness of the ETD key matrix. We propose \emph{Regularized Emphatic Temporal-Difference Learning} (RETD), which preserves the follow-on trace and regularizes only the auxiliary centering recursion, corresponding to lifting the lower-right block of the coupled key matrix from \(1\) to \(1+c\). We derive the RETD core matrix, prove convergence under a conservative sufficient regularization condition, and evaluate the method on diagnostic linear off-policy prediction tasks. The experiments show that RETD avoids the instability of naive centered emphatic learning, preserves favorable emphatic geometry, and exhibits a robust intermediate regime for the regularization parameter \(c\) across the diagnostics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RETD adds a precise regularization only to the auxiliary centering recursion in emphatic TD, restoring matrix definiteness with a convergence proof under a conservative c condition, but leaves the bound's tightness and practical selection uncharacterized.

read the letter

The core contribution is a clean separation: keep the full follow-on trace for emphasis, but regularize only the centering auxiliary recursion by lifting its block from 1 to 1+c. This avoids the definiteness loss that naive centering creates in the ETD key matrix. The authors derive the resulting core matrix explicitly and prove convergence under a stated sufficient condition on c. That construction is new relative to prior ETD and centering papers, and the diagnostic linear off-policy prediction experiments show the method stays stable while retaining emphatic geometry in an intermediate c regime. Those pieces are useful and internally consistent on their own terms.

Referee Report

1 major / 2 minor

Summary. The paper proposes Regularized Emphatic Temporal-Difference Learning (RETD), which preserves the follow-on trace of Emphatic TD while regularizing only the auxiliary centering recursion. This corresponds to lifting the lower-right block of the coupled key matrix from 1 to 1+c. The authors derive the RETD core matrix, prove convergence under a conservative sufficient condition on the regularization parameter c, and evaluate the method on diagnostic linear off-policy prediction tasks. Experiments indicate that RETD avoids the instability of naive centered emphatic learning, preserves favorable emphatic geometry, and exhibits a robust intermediate regime for c.

Significance. If the convergence result holds and the regularization can be applied without eroding the emphatic projection benefits, this work would usefully resolve a stability-geometry tradeoff in off-policy TD with function approximation. The explicit derivation of the core matrix and the convergence proof under the stated condition are clear theoretical strengths; the diagnostic experiments help isolate the effect of c. The conservative sufficient condition, however, leaves the practical tightness and geometry preservation in the reported regime incompletely secured.

major comments (1)

[Convergence proof (abstract and §4)] The convergence proof (described in the abstract as relying on a 'conservative sufficient regularization condition' on c) does not characterize the minimal c that restores positive-definiteness of the key matrix or verify that the intermediate-c values used in the experiments satisfy this minimal requirement while retaining emphatic geometry. Because the abstract claims robustness in that regime, this gap is load-bearing for the central guarantee.

minor comments (2)

[Abstract] The abstract refers to 'diagnostic linear off-policy prediction tasks' without naming the specific tasks, state representations, or performance metrics; adding these details would improve reproducibility and clarity.
[§3 (matrix derivation)] The notation for the coupled key matrix and the block-lifting operation would benefit from an explicit equation (e.g., Eq. (X) showing the 2x2 block structure) at the first mention in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the importance of tightening the convergence analysis. We address the major comment below and will revise the manuscript accordingly to strengthen the connection between the theoretical guarantee and the experimental regime.

read point-by-point responses

Referee: The convergence proof (described in the abstract as relying on a 'conservative sufficient regularization condition' on c) does not characterize the minimal c that restores positive-definiteness of the key matrix or verify that the intermediate-c values used in the experiments satisfy this minimal requirement while retaining emphatic geometry. Because the abstract claims robustness in that regime, this gap is load-bearing for the central guarantee.

Authors: The proof in Section 4 derives a sufficient condition on c that ensures the key matrix remains positive definite, which is conservative by design to allow a clean proof. We do not claim this condition is minimal, and indeed characterizing the exact minimal c for positive-definiteness would require a more refined analysis of the matrix eigenvalues, which is left for future work. For the experiments, the intermediate values of c were chosen based on empirical stability, and the results show that RETD maintains the emphatic projection benefits without instability. To address the concern, we will add to the revised manuscript an explicit computation of the sufficient threshold for the diagnostic tasks and confirm that the reported c values exceed it, while preserving the geometry as evidenced by the performance metrics. We will also qualify the abstract to note that robustness is observed in the regime satisfying the sufficient condition. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation introduces explicit regularization with independent convergence proof

full rationale

The paper starts from the known ETD key matrix and the centering recursion, shows that naive centering can destroy positive-definiteness, then deliberately lifts only the auxiliary block by the scalar c. It derives the resulting RETD core matrix from the modified recursion and proves convergence under an explicitly conservative sufficient condition on c. None of these steps reduces a claimed prediction or theorem to a fitted quantity or to a self-citation by construction; the regularization parameter is an independent design choice whose effect on the matrix is stated algebraically and whose convergence guarantee is proved from the modified equations. The derivation therefore remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on standard MDP ergodicity and bounded-feature assumptions for TD convergence plus the introduction of a single tunable regularization parameter c whose value must satisfy a sufficient condition derived in the paper.

free parameters (1)

regularization parameter c
Positive constant that lifts the lower-right block of the coupled key matrix from 1 to 1+c to restore positive-definiteness.

axioms (1)

domain assumption Standard assumptions for linear TD convergence (ergodicity of the Markov chain under the behavior policy and bounded feature vectors)
Invoked to establish convergence of the RETD iterates under the sufficient condition on c.

pith-pipeline@v0.9.0 · 5513 in / 1423 out tokens · 32485 ms · 2026-05-09T14:27:27.446404+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Residual algorithms: Reinforcement learning with function approximation

Leemon Baird et al. Residual algorithms: Reinforcement learning with function approximation. In Proc. 12th Int. Conf. Mach. Learn., pages 30--37, 1995

work page 1995
[2]

Justin A. Boyan. Least-squares temporal difference learning. In Proceedings of the 16th International Conference on Machine Learning (ICML), pages 49--56, 1999

work page 1999
[3]

Bellman error centering

Xingguo Chen, Yu Gong, Jinguo Ye, Chao Li, Shangdong Yang, and Wenhao Wang. Bellman error centering. Neural Networks, 201: 0 108896, 2026

work page 2026
[4]

Gradient temporal-difference learning with regularized corrections

Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, and Martha White. Gradient temporal-difference learning with regularized corrections. In International Conference on Machine Learning, pages 3524--3534. PMLR, 2020

work page 2020
[5]

Per-etd: A polynomially efficient emphatic temporal difference learning method

Ziwei Guan, Tengyu Xu, and Yingbin Liang. Per-etd: A polynomially efficient emphatic temporal difference learning method. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[6]

Generalized emphatic temporal difference learning: bias-variance analysis

Assaf Hallak, Aviv Tamar, Remi Munos, and Shie Mannor. Generalized emphatic temporal difference learning: bias-variance analysis. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 1631--1637, 2016

work page 2016
[7]

Loosely consistent emphatic temporal-difference learning

Jiamin He, Fengdi Che, Yi Wan, and A Rupam Mahmood. Loosely consistent emphatic temporal-difference learning. In Uncertainty in Artificial Intelligence, pages 849--859. PMLR, 2023

work page 2023
[8]

Emphatic algorithms for deep reinforcement learning

Ray Jiang, Tom Zahavy, Zhongwen Xu, Adam White, Matteo Hessel, Charles Blundell, and Hado Van Hasselt. Emphatic algorithms for deep reinforcement learning. In International Conference on Machine Learning, pages 5023--5033. PMLR, 2021

work page 2021
[9]

Learning expected emphatic traces for deep rl

Ray Jiang, Shangtong Zhang, Veronica Chelu, Adam White, and Hado van Hasselt. Learning expected emphatic traces for deep rl. In Proceedings of the AAAI conference on artificial intelligence, pages 7015--7023, 2022

work page 2022
[10]

The fixed points of off-policy td

J Zico Kolter. The fixed points of off-policy td. In Proceedings of the 25th International Conference on Neural Information Processing Systems, pages 2169--2177, 2011

work page 2011
[11]

Proximal gradient temporal difference learning algorithms

Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Proximal gradient temporal difference learning algorithms. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 4195--4199, 2016

work page 2016
[12]

Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity

Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, and Marek Petrik. Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity. Journal of Artificial Intelligence Research, 63: 0 461--494, 2018

work page 2018
[13]

The ode method for stochastic approximation and reinforcement learning with markovian noise

Shuze Daniel Liu, Shuhang Chen, and Shangtong Zhang. The ode method for stochastic approximation and reinforcement learning with markovian noise. Journal of Machine Learning Research, 26 0 (24): 0 1--76, 2025

work page 2025
[14]

Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view

Bruno Scherrer. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In Proc. 27th Int. Conf. Mach. Learn., pages 959--966, 2010

work page 2010
[15]

Optimality of reinforcement learning algorithms with linear function approximation

Ralf Schoknecht. Optimality of reinforcement learning algorithms with linear function approximation. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pages 1587--1594, 2002

work page 2002
[16]

A convergent o (n) temporal-difference algorithm for off-policy learning with linear function approximation

Richard S Sutton, Hamid R Maei, and Csaba Szepesv \'a ri. A convergent o (n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pages 1609--1616. Cambridge, MA: MIT Press, 2008

work page 2008
[17]

An emphatic approach to the problem of off-policy temporal-difference learning

Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17 0 (1): 0 2603--2631, 2016

work page 2016
[18]

Sutton, H.R

R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv \'a ri, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proc. 26th Int. Conf. Mach. Learn., pages 993--1000, 2009

work page 2009
[19]

Truncated emphatic temporal difference methods for prediction and control

Shangtong Zhang and Shimon Whiteson. Truncated emphatic temporal difference methods for prediction and control. The Journal of Machine Learning Research, 23 0 (1): 0 6859--6917, 2022

work page 2022

[1] [1]

Residual algorithms: Reinforcement learning with function approximation

Leemon Baird et al. Residual algorithms: Reinforcement learning with function approximation. In Proc. 12th Int. Conf. Mach. Learn., pages 30--37, 1995

work page 1995

[2] [2]

Justin A. Boyan. Least-squares temporal difference learning. In Proceedings of the 16th International Conference on Machine Learning (ICML), pages 49--56, 1999

work page 1999

[3] [3]

Bellman error centering

Xingguo Chen, Yu Gong, Jinguo Ye, Chao Li, Shangdong Yang, and Wenhao Wang. Bellman error centering. Neural Networks, 201: 0 108896, 2026

work page 2026

[4] [4]

Gradient temporal-difference learning with regularized corrections

Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, and Martha White. Gradient temporal-difference learning with regularized corrections. In International Conference on Machine Learning, pages 3524--3534. PMLR, 2020

work page 2020

[5] [5]

Per-etd: A polynomially efficient emphatic temporal difference learning method

Ziwei Guan, Tengyu Xu, and Yingbin Liang. Per-etd: A polynomially efficient emphatic temporal difference learning method. In International Conference on Learning Representations (ICLR), 2022

work page 2022

[6] [6]

Generalized emphatic temporal difference learning: bias-variance analysis

Assaf Hallak, Aviv Tamar, Remi Munos, and Shie Mannor. Generalized emphatic temporal difference learning: bias-variance analysis. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 1631--1637, 2016

work page 2016

[7] [7]

Loosely consistent emphatic temporal-difference learning

Jiamin He, Fengdi Che, Yi Wan, and A Rupam Mahmood. Loosely consistent emphatic temporal-difference learning. In Uncertainty in Artificial Intelligence, pages 849--859. PMLR, 2023

work page 2023

[8] [8]

Emphatic algorithms for deep reinforcement learning

Ray Jiang, Tom Zahavy, Zhongwen Xu, Adam White, Matteo Hessel, Charles Blundell, and Hado Van Hasselt. Emphatic algorithms for deep reinforcement learning. In International Conference on Machine Learning, pages 5023--5033. PMLR, 2021

work page 2021

[9] [9]

Learning expected emphatic traces for deep rl

Ray Jiang, Shangtong Zhang, Veronica Chelu, Adam White, and Hado van Hasselt. Learning expected emphatic traces for deep rl. In Proceedings of the AAAI conference on artificial intelligence, pages 7015--7023, 2022

work page 2022

[10] [10]

The fixed points of off-policy td

J Zico Kolter. The fixed points of off-policy td. In Proceedings of the 25th International Conference on Neural Information Processing Systems, pages 2169--2177, 2011

work page 2011

[11] [11]

Proximal gradient temporal difference learning algorithms

Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Proximal gradient temporal difference learning algorithms. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 4195--4199, 2016

work page 2016

[12] [12]

Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity

Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, and Marek Petrik. Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity. Journal of Artificial Intelligence Research, 63: 0 461--494, 2018

work page 2018

[13] [13]

The ode method for stochastic approximation and reinforcement learning with markovian noise

Shuze Daniel Liu, Shuhang Chen, and Shangtong Zhang. The ode method for stochastic approximation and reinforcement learning with markovian noise. Journal of Machine Learning Research, 26 0 (24): 0 1--76, 2025

work page 2025

[14] [14]

Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view

Bruno Scherrer. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In Proc. 27th Int. Conf. Mach. Learn., pages 959--966, 2010

work page 2010

[15] [15]

Optimality of reinforcement learning algorithms with linear function approximation

Ralf Schoknecht. Optimality of reinforcement learning algorithms with linear function approximation. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pages 1587--1594, 2002

work page 2002

[16] [16]

A convergent o (n) temporal-difference algorithm for off-policy learning with linear function approximation

Richard S Sutton, Hamid R Maei, and Csaba Szepesv \'a ri. A convergent o (n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pages 1609--1616. Cambridge, MA: MIT Press, 2008

work page 2008

[17] [17]

An emphatic approach to the problem of off-policy temporal-difference learning

Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17 0 (1): 0 2603--2631, 2016

work page 2016

[18] [18]

Sutton, H.R

R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv \'a ri, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proc. 26th Int. Conf. Mach. Learn., pages 993--1000, 2009

work page 2009

[19] [19]

Truncated emphatic temporal difference methods for prediction and control

Shangtong Zhang and Shimon Whiteson. Truncated emphatic temporal difference methods for prediction and control. The Journal of Machine Learning Research, 23 0 (1): 0 6859--6917, 2022

work page 2022