Regularized Centered Emphatic Temporal Difference Learning
Pith reviewed 2026-05-09 14:27 UTC · model grok-4.3
The pith
Regularized emphatic TD learning stabilizes off-policy updates by regularizing only the auxiliary centering recursion while preserving the follow-on trace.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The core discovery is that Regularized Emphatic Temporal-Difference Learning (RETD) preserves the follow-on trace, regularizes only the auxiliary centering recursion by lifting the lower-right block of the coupled key matrix from 1 to 1+c, yields a derived RETD core matrix that remains positive definite under a conservative sufficient condition on c, converges, and retains favorable emphatic geometry on linear off-policy prediction tasks.
What carries the argument
The RETD core matrix formed by lifting the lower-right block of the ETD key matrix from 1 to 1+c, which regularizes the auxiliary centering recursion while leaving the follow-on trace untouched.
If this is right
- RETD converges under the conservative sufficient regularization condition on c.
- RETD avoids the instability observed in naive centered emphatic learning.
- RETD preserves the favorable projection geometry of emphatic methods on linear off-policy prediction tasks.
- An intermediate range of c yields robust performance across the diagnostic tasks.
Where Pith is reading between the lines
- The selective regularization pattern may extend to other TD variants that combine emphasis with auxiliary recursions.
- Practical selection of c could be guided by monitoring matrix eigenvalues during learning rather than relying solely on the conservative bound.
- The approach suggests a general template for stabilizing coupled linear recursions by regularizing only the destabilizing block.
Load-bearing premise
A conservative sufficient condition on the regularization parameter c is strong enough to guarantee positive-definiteness and convergence without erasing the geometric benefits of emphasis.
What would settle it
Run the diagnostic linear off-policy tasks with values of c below the stated sufficient condition and check whether the key matrix loses positive definiteness or the iterates diverge.
Figures
read the original abstract
Off-policy temporal-difference (TD) learning with function approximation faces a structural tradeoff among stability, projection geometry, and variance control. Emphatic TD (ETD) improves the off-policy projection geometry through follow-on emphasis, but the follow-on trace can have high variance. We revisit this tradeoff through Bellman-error centering. Although centering naturally removes a common drift term from TD errors, we show that a naive centered emphatic extension introduces an auxiliary coupling that can destroy the positive-definiteness of the ETD key matrix. We propose \emph{Regularized Emphatic Temporal-Difference Learning} (RETD), which preserves the follow-on trace and regularizes only the auxiliary centering recursion, corresponding to lifting the lower-right block of the coupled key matrix from \(1\) to \(1+c\). We derive the RETD core matrix, prove convergence under a conservative sufficient regularization condition, and evaluate the method on diagnostic linear off-policy prediction tasks. The experiments show that RETD avoids the instability of naive centered emphatic learning, preserves favorable emphatic geometry, and exhibits a robust intermediate regime for the regularization parameter \(c\) across the diagnostics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Regularized Emphatic Temporal-Difference Learning (RETD), which preserves the follow-on trace of Emphatic TD while regularizing only the auxiliary centering recursion. This corresponds to lifting the lower-right block of the coupled key matrix from 1 to 1+c. The authors derive the RETD core matrix, prove convergence under a conservative sufficient condition on the regularization parameter c, and evaluate the method on diagnostic linear off-policy prediction tasks. Experiments indicate that RETD avoids the instability of naive centered emphatic learning, preserves favorable emphatic geometry, and exhibits a robust intermediate regime for c.
Significance. If the convergence result holds and the regularization can be applied without eroding the emphatic projection benefits, this work would usefully resolve a stability-geometry tradeoff in off-policy TD with function approximation. The explicit derivation of the core matrix and the convergence proof under the stated condition are clear theoretical strengths; the diagnostic experiments help isolate the effect of c. The conservative sufficient condition, however, leaves the practical tightness and geometry preservation in the reported regime incompletely secured.
major comments (1)
- [Convergence proof (abstract and §4)] The convergence proof (described in the abstract as relying on a 'conservative sufficient regularization condition' on c) does not characterize the minimal c that restores positive-definiteness of the key matrix or verify that the intermediate-c values used in the experiments satisfy this minimal requirement while retaining emphatic geometry. Because the abstract claims robustness in that regime, this gap is load-bearing for the central guarantee.
minor comments (2)
- [Abstract] The abstract refers to 'diagnostic linear off-policy prediction tasks' without naming the specific tasks, state representations, or performance metrics; adding these details would improve reproducibility and clarity.
- [§3 (matrix derivation)] The notation for the coupled key matrix and the block-lifting operation would benefit from an explicit equation (e.g., Eq. (X) showing the 2x2 block structure) at the first mention in the main text.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting the importance of tightening the convergence analysis. We address the major comment below and will revise the manuscript accordingly to strengthen the connection between the theoretical guarantee and the experimental regime.
read point-by-point responses
-
Referee: The convergence proof (described in the abstract as relying on a 'conservative sufficient regularization condition' on c) does not characterize the minimal c that restores positive-definiteness of the key matrix or verify that the intermediate-c values used in the experiments satisfy this minimal requirement while retaining emphatic geometry. Because the abstract claims robustness in that regime, this gap is load-bearing for the central guarantee.
Authors: The proof in Section 4 derives a sufficient condition on c that ensures the key matrix remains positive definite, which is conservative by design to allow a clean proof. We do not claim this condition is minimal, and indeed characterizing the exact minimal c for positive-definiteness would require a more refined analysis of the matrix eigenvalues, which is left for future work. For the experiments, the intermediate values of c were chosen based on empirical stability, and the results show that RETD maintains the emphatic projection benefits without instability. To address the concern, we will add to the revised manuscript an explicit computation of the sufficient threshold for the diagnostic tasks and confirm that the reported c values exceed it, while preserving the geometry as evidenced by the performance metrics. We will also qualify the abstract to note that robustness is observed in the regime satisfying the sufficient condition. revision: partial
Circularity Check
No significant circularity; derivation introduces explicit regularization with independent convergence proof
full rationale
The paper starts from the known ETD key matrix and the centering recursion, shows that naive centering can destroy positive-definiteness, then deliberately lifts only the auxiliary block by the scalar c. It derives the resulting RETD core matrix from the modified recursion and proves convergence under an explicitly conservative sufficient condition on c. None of these steps reduces a claimed prediction or theorem to a fitted quantity or to a self-citation by construction; the regularization parameter is an independent design choice whose effect on the matrix is stated algebraically and whose convergence guarantee is proved from the modified equations. The derivation therefore remains self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- regularization parameter c
axioms (1)
- domain assumption Standard assumptions for linear TD convergence (ergodicity of the Markov chain under the behavior policy and bounded feature vectors)
Reference graph
Works this paper leans on
-
[1]
Residual algorithms: Reinforcement learning with function approximation
Leemon Baird et al. Residual algorithms: Reinforcement learning with function approximation. In Proc. 12th Int. Conf. Mach. Learn., pages 30--37, 1995
work page 1995
-
[2]
Justin A. Boyan. Least-squares temporal difference learning. In Proceedings of the 16th International Conference on Machine Learning (ICML), pages 49--56, 1999
work page 1999
-
[3]
Xingguo Chen, Yu Gong, Jinguo Ye, Chao Li, Shangdong Yang, and Wenhao Wang. Bellman error centering. Neural Networks, 201: 0 108896, 2026
work page 2026
-
[4]
Gradient temporal-difference learning with regularized corrections
Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, and Martha White. Gradient temporal-difference learning with regularized corrections. In International Conference on Machine Learning, pages 3524--3534. PMLR, 2020
work page 2020
-
[5]
Per-etd: A polynomially efficient emphatic temporal difference learning method
Ziwei Guan, Tengyu Xu, and Yingbin Liang. Per-etd: A polynomially efficient emphatic temporal difference learning method. In International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[6]
Generalized emphatic temporal difference learning: bias-variance analysis
Assaf Hallak, Aviv Tamar, Remi Munos, and Shie Mannor. Generalized emphatic temporal difference learning: bias-variance analysis. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 1631--1637, 2016
work page 2016
-
[7]
Loosely consistent emphatic temporal-difference learning
Jiamin He, Fengdi Che, Yi Wan, and A Rupam Mahmood. Loosely consistent emphatic temporal-difference learning. In Uncertainty in Artificial Intelligence, pages 849--859. PMLR, 2023
work page 2023
-
[8]
Emphatic algorithms for deep reinforcement learning
Ray Jiang, Tom Zahavy, Zhongwen Xu, Adam White, Matteo Hessel, Charles Blundell, and Hado Van Hasselt. Emphatic algorithms for deep reinforcement learning. In International Conference on Machine Learning, pages 5023--5033. PMLR, 2021
work page 2021
-
[9]
Learning expected emphatic traces for deep rl
Ray Jiang, Shangtong Zhang, Veronica Chelu, Adam White, and Hado van Hasselt. Learning expected emphatic traces for deep rl. In Proceedings of the AAAI conference on artificial intelligence, pages 7015--7023, 2022
work page 2022
-
[10]
The fixed points of off-policy td
J Zico Kolter. The fixed points of off-policy td. In Proceedings of the 25th International Conference on Neural Information Processing Systems, pages 2169--2177, 2011
work page 2011
-
[11]
Proximal gradient temporal difference learning algorithms
Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Proximal gradient temporal difference learning algorithms. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 4195--4199, 2016
work page 2016
-
[12]
Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, and Marek Petrik. Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity. Journal of Artificial Intelligence Research, 63: 0 461--494, 2018
work page 2018
-
[13]
The ode method for stochastic approximation and reinforcement learning with markovian noise
Shuze Daniel Liu, Shuhang Chen, and Shangtong Zhang. The ode method for stochastic approximation and reinforcement learning with markovian noise. Journal of Machine Learning Research, 26 0 (24): 0 1--76, 2025
work page 2025
-
[14]
Bruno Scherrer. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In Proc. 27th Int. Conf. Mach. Learn., pages 959--966, 2010
work page 2010
-
[15]
Optimality of reinforcement learning algorithms with linear function approximation
Ralf Schoknecht. Optimality of reinforcement learning algorithms with linear function approximation. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pages 1587--1594, 2002
work page 2002
-
[16]
Richard S Sutton, Hamid R Maei, and Csaba Szepesv \'a ri. A convergent o (n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pages 1609--1616. Cambridge, MA: MIT Press, 2008
work page 2008
-
[17]
An emphatic approach to the problem of off-policy temporal-difference learning
Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17 0 (1): 0 2603--2631, 2016
work page 2016
-
[18]
R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv \'a ri, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proc. 26th Int. Conf. Mach. Learn., pages 993--1000, 2009
work page 2009
-
[19]
Truncated emphatic temporal difference methods for prediction and control
Shangtong Zhang and Shimon Whiteson. Truncated emphatic temporal difference methods for prediction and control. The Journal of Machine Learning Research, 23 0 (1): 0 6859--6917, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.