Deep Reinforcement Learning and the Deadly Triad

Florian Strub; Hado van Hasselt; Joseph Modayil; Matteo Hessel; Nicolas Sonnerat; Yotam Doron

arxiv: 1812.02648 · v1 · pith:ADY4X6ARnew · submitted 2018-12-06 · 💻 cs.AI · cs.LG

Deep Reinforcement Learning and the Deadly Triad

Hado van Hasselt , Yotam Doron , Florian Strub , Matteo Hessel , Nicolas Sonnerat , Joseph Modayil This is my paper

classification 💻 cs.AI cs.LG

keywords learningdeadlytriaddeepreinforcementpropertiesthreeagent

0 comments

read the original abstract

We know from reinforcement learning theory that temporal difference learning can fail in certain cases. Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three properties are combined, learning can diverge with the value estimates becoming unbounded. However, several algorithms successfully combine these three properties, which indicates that there is at least a partial gap in our understanding. In this work, we investigate the impact of the deadly triad in practice, in the context of a family of popular deep reinforcement learning models - deep Q-networks trained with experience replay - analysing how the components of this system play a role in the emergence of the deadly triad, and in the agent's performance

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing
stat.ML 2026-05 unverdicted novelty 7.0

Finite-sample risk bounds for DQN with ReLU networks are extended to τ-mixing data, showing an extra dimensionality penalty in the convergence rate due to dependence.
Replicable Reinforcement Learning with Linear Function Approximation
cs.LG 2025-09 unverdicted novelty 7.0

Introduces replicable random design regression and covariance estimation tools to enable the first provably efficient replicable RL algorithms for linear MDPs in generative and episodic settings.
Behavior-Consistent Deep Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
Behavior-Consistent Deep Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

QED sets state-dependent temperature proportional to double-critic disagreement to bound pairwise KL divergence between Boltzmann policies, cutting cross-run divergence by two orders of magnitude on 18 continuous-cont...
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
AdamO: A Collapse-Suppressed Optimizer for Offline RL
cs.LG 2026-05 unverdicted novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
cs.LG 2025-10 unverdicted novelty 6.0

MINTO sets bootstrapped targets to the minimum of online and target network estimates, yielding faster stable value learning across online/offline RL and discrete/continuous actions.
Behavior Regularized Offline Reinforcement Learning
cs.LG 2019-11 unverdicted novelty 6.0

Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
Deep Double Q-learning
cs.LG 2025-06 unverdicted novelty 5.0

Deep Double Q-learning explicitly trains two Q-functions in deep RL, outperforming Double DQN on 47 of 57 Atari games while further reducing overestimation.
Plasticity Loss in Deep Reinforcement Learning: A Survey
cs.AI 2024-11 unverdicted novelty 4.0

Survey unifies the definition of plasticity loss in DRL, taxonomizes over 50 mitigations, identifies evaluation gaps, and finds general regularization often outperforms domain-specific methods.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.