pith. sign in

arxiv: 2507.00275 · v2 · pith:RJJBUJQFnew · submitted 2025-06-30 · 💻 cs.LG · cs.AI

Deep Double Q-learning

Pith reviewed 2026-05-22 00:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords deep reinforcement learningdouble q-learningoverestimation biasatari gamesq-functionstarget networksreplay ratio
0
0 comments X

The pith

Deep Double Q-learning explicitly trains two Q-functions to decouple selection from evaluation and reduce overestimation in deep RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Deep Double Q-learning to fully adapt classical Double Q-learning into deep reinforcement learning. It trains two independent action-value functions so that action selection and evaluation are decoupled when forming bootstrap targets, unlike Double DQN which trains only one function and leaves the estimators correlated. The authors stabilize the dual training by lowering replay ratios, lengthening target network update intervals, and sharing layers between the two functions. Across 57 Atari 2600 games this produces higher aggregate performance than Double DQN while further cutting overestimation.

Core claim

Deep Double Q-learning explicitly trains two Q-functions through Double Q-learning and decouples action-selection from action-evaluation in the bootstrap targets. Training is stabilized through lower replay ratios, longer target network update intervals, and shared layers, which together reduce overestimation and raise performance relative to Double DQN on Atari 2600 games.

What carries the argument

Two independent Q-functions that decouple action-selection from action-evaluation when computing bootstrap targets, stabilized by adjusted replay and target-update schedules plus shared layers.

If this is right

  • DDQL outperforms Double DQN on 47 of the 57 Atari games while lowering overestimation further.
  • Lower replay ratios and longer target-update intervals are required to keep the two estimators stable.
  • Shared layers between the two Q-functions help avoid new instabilities during dual training.
  • Minibatch sampling strategies and network architecture choices matter for successful adaptation of Double Q-learning to deep RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling principle could be tested in continuous-control or robotic domains where overestimation also appears.
  • Similar explicit separation of selection and evaluation might reduce bias in other deep RL methods such as actor-critic algorithms.
  • Extending the stabilizations to deeper or wider networks would test whether the approach scales beyond the Atari setting.

Load-bearing premise

The specific combination of lower replay ratios, longer target network update intervals, and shared layers will stabilize training of two independent Q-functions without reintroducing estimator correlations or new instabilities.

What would settle it

Running DDQL on the same 57 Atari games and still observing high overestimation or no aggregate improvement over Double DQN would falsify the central claim.

Figures

Figures reproduced from arXiv: 2507.00275 by Marlos C. Machado, Martha White, Prabhat Nagarajan.

Figure 1
Figure 1. Figure 1: Reciprocal boot￾strapping. Each value func￾tion bootstraps the other value function. Double estimation with reciprocal bootstrapping Double estimation, in addition to implementing target boot￾strap decoupling, has additional requirements to further de￾correlate action-selection and action-evaluation in the boot￾strap target. In double estimation, two Q-functions are ex￾plicitly learned, with bootstrap targ… view at source ↗
Figure 2
Figure 2. Figure 2: Network architectures for DQN and DDQL variants. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Final overestimations averaged across five seeds of Double DQN, DH-DDQL, [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Human-normalized scores throughout training. Note that the scale of the y-axes [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Improvement in terms of HNS of DH-DDQL ( [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of DH-DDQL compared to DH-DDQL (double buffer). The algo [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overestimation of DH-DDQL compared to DH-DDQL (double buffer). The [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of DN-DDQL compared to DN-DDQL (double buffer). DN-DDQL [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of DH-DDQL compared to DH-DDQL(RR = [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance of DN-DDQL compared to DN-DDQL(RR = [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Overestimation of DH-DDQL(RR = 1 4 ), DN-DDQL(RR = 1 4 ), and Double DQN. Overestimation is clipped at -8 due to divergence in BattleZone. The DDQL variants continue to reduce overestimation even with double the replay ratio. that on a per-update basis, DDQL is more efficient than Double DQN at credit assignment. Moreover, these results indicate that DDQL benefits greatly from increased stationarity in th… view at source ↗
Figure 12
Figure 12. Figure 12: Overestimation of five algorithms on NameThisGame. Increased de-correlation re￾duces overestimation. Shaded region: 95% con￾fidence interval over five seeds. Our results yields two key insights. The first, though expected, is that pro￾gressive de-correlation as per our three defining features generally reduces overestimation. Double DQN, which only implements target bootstrap de￾coupling, has the least am… view at source ↗
Figure 13
Figure 13. Figure 13: Final overestimations (across five seeds) of Double DQN, DH-DDQL, and DN [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Scores across 50M timesteps across 57 Atari 2600 games. [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Overestimation across 50M timesteps across 57 Atari 2600 games. [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A comparison of DN-DDQL with a short target network update interval to [PITH_FULL_IMAGE:figures/full_fig_p041_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Performance of DH-DDQL compared to DH-DDQL [PITH_FULL_IMAGE:figures/full_fig_p043_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Performance of DN-DDQL compared to DN-DDQL [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗
read the original abstract

Double Q-learning is a classical control algorithm that mitigates the maximization bias of Q-learning. To do so, it explicitly trains two independent action-value functions and uses them to decouple action-selection and action-evaluation when computing bootstrap targets. Double DQN adapts target bootstrap decoupling to deep reinforcement learning (RL), but explicitly trains only a single action-value function and does not fully decouple its estimators. Consequently, the two estimators remain correlated, and overestimation persists. In this paper, we introduce Deep Double Q-learning (DDQL), a deep RL algorithm that explicitly trains two Q-functions through Double Q-learning. DDQL stabilizes training through a combination of techniques, including lower replay ratios, longer target network update intervals, and shared layers. Across 57 Atari 2600 games, DDQL improves aggregate performance over Double DQN, outperforming it on 47 games while further reducing overestimation. In addition, we study key design choices when adapting Double Q-learning to deep RL, including the network architecture, replay ratio, and minibatch sampling strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Deep Double Q-learning (DDQL), a deep RL adaptation of classical Double Q-learning that explicitly trains two action-value functions to decouple action selection from evaluation. DDQL employs stabilization techniques including lower replay ratios, longer target-network update intervals, and shared layers between the two Q-networks. On 57 Atari 2600 games, DDQL is reported to outperform Double DQN on 47 games with higher aggregate performance and further reduced overestimation; the paper also examines design choices such as network architecture, replay ratio, and minibatch sampling.

Significance. If the empirical gains prove robust, the work is significant for demonstrating that fuller realization of the Double Q-learning decoupling mechanism can yield measurable improvements over Double DQN in deep settings. The large-scale Atari evaluation and explicit study of stabilization hyperparameters provide practical guidance for mitigating maximization bias. The manuscript does not ship machine-checked proofs or parameter-free derivations, but the reproducible benchmarking protocol on a standard suite is a positive attribute.

major comments (2)
  1. [Network Architecture and Stabilization] Network Architecture and Stabilization section: the claim that shared layers plus lower replay ratio and longer target updates preserve sufficient estimator independence is load-bearing for the central decoupling argument, yet no direct measurement (e.g., correlation between the two Q-head outputs or gradient alignment statistics) or ablation removing the shared backbone is presented. Shared parameters allow gradients from both heads to update the same features, which risks reintroducing the very correlations Double Q-learning is intended to avoid.
  2. [Empirical Results] Empirical Results section (Atari evaluation): the reported 47/57 win rate and aggregate improvement lack error bars across random seeds, confidence intervals, or statistical significance tests. Without these, it is impossible to determine whether the observed gains are distinguishable from training stochasticity or hyperparameter sensitivity, weakening the claim that DDQL reliably outperforms Double DQN.
minor comments (2)
  1. [Abstract] The abstract states that DDQL 'further reduc[es] overestimation' but does not define the precise overestimation metric or show the corresponding plot or table reference.
  2. [Figures] Figure captions for learning curves should explicitly state the number of independent runs and whether shaded regions represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on Deep Double Q-learning. We address each major comment below and describe the revisions we intend to incorporate.

read point-by-point responses
  1. Referee: [Network Architecture and Stabilization] Network Architecture and Stabilization section: the claim that shared layers plus lower replay ratio and longer target updates preserve sufficient estimator independence is load-bearing for the central decoupling argument, yet no direct measurement (e.g., correlation between the two Q-head outputs or gradient alignment statistics) or ablation removing the shared backbone is presented. Shared parameters allow gradients from both heads to update the same features, which risks reintroducing the very correlations Double Q-learning is intended to avoid.

    Authors: We agree that direct measurements of estimator independence and an ablation with fully separate backbones would strengthen the central argument. Our design uses shared layers for computational efficiency and feature reuse while relying on separate output heads together with reduced replay ratios and extended target-update intervals to limit correlation; the observed further reduction in overestimation provides indirect support. Nevertheless, the absence of explicit correlation statistics or a no-shared-backbone ablation is a limitation. We will add both an analysis of Q-head output correlations and gradient alignment as well as the requested ablation study in the revised manuscript. revision: yes

  2. Referee: [Empirical Results] Empirical Results section (Atari evaluation): the reported 47/57 win rate and aggregate improvement lack error bars across random seeds, confidence intervals, or statistical significance tests. Without these, it is impossible to determine whether the observed gains are distinguishable from training stochasticity or hyperparameter sensitivity, weakening the claim that DDQL reliably outperforms Double DQN.

    Authors: We concur that reporting variability across random seeds and formal statistical comparisons would make the empirical claims more robust. The 47/57 win rate and aggregate scores were obtained from single runs per game, consistent with standard large-scale Atari reporting, yet this practice does leave the results vulnerable to seed-specific effects. We will rerun the full evaluation suite with multiple independent seeds, include error bars and confidence intervals, and add statistical significance tests between DDQL and Double DQN in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical algorithm proposal and benchmarking

full rationale

The paper introduces DDQL by adapting the classical Double Q-learning algorithm (which decouples selection and evaluation via two independent Q-functions) to deep networks, then stabilizes training with replay ratio, target update frequency, and shared layers before reporting Atari 2600 results. No derivation, equation, or 'prediction' is shown to reduce to a fitted parameter or self-citation by construction. All performance claims rest on external benchmark comparisons (57 games) rather than internal self-reference. The skeptic concern about shared layers reintroducing correlations is an assumption-validity issue, not a circularity reduction. This is a standard empirical RL paper with independent external validation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the approach inherits standard RL assumptions about value estimation and adds design choices for stability. Limited information prevents exhaustive listing of all free parameters or axioms.

free parameters (2)
  • replay ratio
    Lower replay ratios chosen to stabilize training of two Q-functions.
  • target network update interval
    Longer intervals used as a stabilization technique.
axioms (1)
  • domain assumption Explicitly training two independent action-value functions decouples selection and evaluation sufficiently to reduce overestimation in deep RL.
    Core premise drawn from classical Double Q-learning and applied here.

pith-pipeline@v0.9.0 · 5705 in / 1202 out tokens · 45813 ms · 2026-05-22T00:19:19.011491+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 3 internal anchors

  1. [1]

    Agarwal, D

    R. Agarwal, D. Schuurmans, and M. Norouzi. An Optimistic Perspective on Offline Reinforcement Learning . In International Conference on Machine Learning, 2020

  2. [2]

    Agarwal, M

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. G. Bellemare. Deep Reinforcement Learning at the Edge of the Statistical Precipice . Neural Information Processing Systems, 2021

  3. [3]

    Aitchison, P

    M. Aitchison, P. Sweetser, and M. Hutter. Atari-5: Distilling the Arcade Learning Environment down to Five Games . In International Conference on Machine Learning, pages 421--438, 2023

  4. [4]

    Anschel, N

    O. Anschel, N. Baram, and N. Shimkin. Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning . In International Conference on Machine Learning, 2017

  5. [5]

    M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents . Journal of Artificial Intelligence Research, 2013

  6. [6]

    M. G. Bellemare, W. Dabney, and R. Munos. A Distributional Perspective on Reinforcement Learning . In International Conference on Machine Learning, 2017

  7. [7]

    G. Chen. Decorrelated Double Q-learning . arXiv preprint arXiv:2006.06956, 2020

  8. [8]

    X. Chen, C. Wang, Z. Zhou, and K. Ross. Randomized Ensembled Double Q-Learning: Learning Fast Without a Model . In International Conference on Learning Representations, 2021

  9. [9]

    Farebrother, J

    J. Farebrother, J. Orbay, Q. Vuong, A. Ali Taiga, Y. Chebotar, T. Xiao, A. Irpan, S. Levine, P. S. Castro, A. Faust, A. Kumar, and R. Agarwal. Stop Regressing: Training Value Functions via Classification for Scalable Deep RL . In International Conference on Machine Learning, 2024

  10. [10]

    Fedus, P

    W. Fedus, P. Ramachandran, R. Agarwal, Y. Bengio, H. Larochelle, M. Rowland, and W. Dabney. Revisiting Fundamentals of Experience Replay . In International Conference on Machine Learning, 2020

  11. [11]

    Fujimoto, H

    S. Fujimoto, H. van Hoof, and D. Meger. Addressing Function Approximation Error in Actor-Critic Methods . In International Conference on Machine Learning, pages 1587--1596, 2018

  12. [12]

    Fujita, P

    Y. Fujita, P. Nagarajan, T. Kataoka, and T. Ishikawa. ChainerRL: A Deep Reinforcement Learning Library . Journal of Machine Learning Research, 2021

  13. [13]

    Soft Actor-Critic Algorithms and Applications

    T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft Actor-Critic Algorithms and Applications . arXiv preprint arXiv:1812.05905, 2018

  14. [14]

    Hessel, J

    M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining Improvements in Deep Reinforcement Learning . In AAAI Conference on Artificial Intelligence, 2018

  15. [15]

    G. H. John. When the Best Move Isn’t Optimal: Q-learning with Exploration . In AAAI Conference on Artificial Intelligence, 1994

  16. [16]

    Q. Lan, Y. Pan, A. Fyshe, and M. White. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning . In International Conference on Learning Representations, 2020

  17. [17]

    LeCun, Y

    Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015

  18. [18]

    L.-J. Lin. Reinforcement Learning and Teaching . In AAAI Conference on Artificial Intelligence, 1991

  19. [19]

    L.-J. Lin. Reinforcement Learning for Robots Using Neural Networks. Carnegie Mellon University, 1992

  20. [20]

    M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents . Journal of Artificial Intelligence Research, 2018

  21. [21]

    V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning . Nature, 2015

  22. [22]

    J. S. Obando-Ceron and P. S. Castro. Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research . In International Conference on Machine Learning, 2021

  23. [23]

    Osband, C

    I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep Exploration via Bootstrapped DQN . Neural Information Processing Systems, 2016

  24. [24]

    Ostrovski, P

    G. Ostrovski, P. S. Castro, and W. Dabney. The Difficulty of Passive Learning in Deep Reinforcement Learning . Neural Information Processing Systems, 2021

  25. [25]

    Patterson, S

    A. Patterson, S. Neumann, M. White, and A. White. Empirical Design in Reinforcement Learning . Journal of Machine Learning Research, 2024

  26. [26]

    O. Peer, C. Tessler, N. Merlis, and R. Meir. Ensemble Bootstrapping for Q-Learning . In International Conference on Machine Learning, 2021

  27. [27]

    Quan and G

    J. Quan and G. Ostrovski. DQN Zoo : Reference implementations of DQN -based agents, 2020. URL http://github.com/deepmind/dqn_zoo

  28. [28]

    Schaul, J

    T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized Experience Replay . In International Conference on Learning Representations, 2016

  29. [29]

    Schaul, A

    T. Schaul, A. Barreto, J. Quan, and G. Ostrovski. The Phenomenon of Policy Churn . Neural Information Processing Systems, 2022

  30. [30]

    J. E. Smith and R. L. Winkler. The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis . Management Science, 2006

  31. [31]

    R. S. Sutton. Learning to Predict by the Methods of Temporal Differences . Machine learning, 1988

  32. [32]

    R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 2018

  33. [33]

    Thrun and A

    S. Thrun and A. Schwartz. Issues in Using Function Approximation for Reinforcement Learning . In Connectionist Models Summer School, 1993

  34. [34]

    Tieleman

    T. Tieleman. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 0 (2): 0 26, 2012

  35. [35]

    Gymnasium: A Standard Interface for Reinforcement Learning Environments

    M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goul \ a o, A. Kallinteris, M. Krimmel, A. KG, et al. Gymnasium: A Standard Interface for Reinforcement Learning Environments . arXiv preprint arXiv:2407.17032, 2024

  36. [36]

    van Hasselt

    H. van Hasselt. Double Q-learning . Neural Information Processing Systems, 2010

  37. [37]

    van Hasselt, A

    H. van Hasselt, A. Guez, and D. Silver. Deep Reinforcement Learning with Double Q-learning . In AAAI Conference on Artificial Intelligence, 2016

  38. [38]

    Deep Reinforcement Learning and the Deadly Triad

    H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil. Deep Reinforcement Learning and the Deadly Triad . arXiv preprint arXiv:1812.02648, 2018

  39. [39]

    van Hasselt, M

    H. van Hasselt, M. Hessel, and J. Aslanides. When to use parametric models in reinforcement learning? In Neural Information Processing Systems, 2019

  40. [40]

    Wagenbach and M

    J. Wagenbach and M. Sabatelli. Factors of Influence of the Overestimation Bias of Q-Learning . arXiv preprint arXiv:2210.05262, 2022

  41. [41]

    Waltz and O

    M. Waltz and O. Okhrin. Addressing maximization bias in reinforcement learning with two-sample testing. Artificial Intelligence, 2024

  42. [42]

    Wang and A

    X. Wang and A. Vinel. Cross Learning in Deep Q-Networks . arXiv preprint arXiv:2009.13780, 2020

  43. [43]

    Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas. Dueling Network Architectures for Deep Reinforcement Learning . In International Conference on Machine Learning, 2016

  44. [44]

    C. J. Watkins. Learning from Delayed Rewards . PhD thesis , University of Cambridge, Cambridge, UK, 1989

  45. [45]

    C. J. Watkins and P. Dayan. Q-learning. Machine learning, 1992

  46. [46]

    Zhu and M

    R. Zhu and M. Rigotti. Self-correcting Q-learning . In AAAI Conference on Artificial Intelligence, 2021

  47. [47]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...