Deep Double Q-learning

Marlos C. Machado; Martha White; Prabhat Nagarajan

arxiv: 2507.00275 · v2 · pith:RJJBUJQFnew · submitted 2025-06-30 · 💻 cs.LG · cs.AI

Deep Double Q-learning

Prabhat Nagarajan , Martha White , Marlos C. Machado This is my paper

Pith reviewed 2026-05-22 00:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords deep reinforcement learningdouble q-learningoverestimation biasatari gamesq-functionstarget networksreplay ratio

0 comments

The pith

Deep Double Q-learning explicitly trains two Q-functions to decouple selection from evaluation and reduce overestimation in deep RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Deep Double Q-learning to fully adapt classical Double Q-learning into deep reinforcement learning. It trains two independent action-value functions so that action selection and evaluation are decoupled when forming bootstrap targets, unlike Double DQN which trains only one function and leaves the estimators correlated. The authors stabilize the dual training by lowering replay ratios, lengthening target network update intervals, and sharing layers between the two functions. Across 57 Atari 2600 games this produces higher aggregate performance than Double DQN while further cutting overestimation.

Core claim

Deep Double Q-learning explicitly trains two Q-functions through Double Q-learning and decouples action-selection from action-evaluation in the bootstrap targets. Training is stabilized through lower replay ratios, longer target network update intervals, and shared layers, which together reduce overestimation and raise performance relative to Double DQN on Atari 2600 games.

What carries the argument

Two independent Q-functions that decouple action-selection from action-evaluation when computing bootstrap targets, stabilized by adjusted replay and target-update schedules plus shared layers.

If this is right

DDQL outperforms Double DQN on 47 of the 57 Atari games while lowering overestimation further.
Lower replay ratios and longer target-update intervals are required to keep the two estimators stable.
Shared layers between the two Q-functions help avoid new instabilities during dual training.
Minibatch sampling strategies and network architecture choices matter for successful adaptation of Double Q-learning to deep RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling principle could be tested in continuous-control or robotic domains where overestimation also appears.
Similar explicit separation of selection and evaluation might reduce bias in other deep RL methods such as actor-critic algorithms.
Extending the stabilizations to deeper or wider networks would test whether the approach scales beyond the Atari setting.

Load-bearing premise

The specific combination of lower replay ratios, longer target network update intervals, and shared layers will stabilize training of two independent Q-functions without reintroducing estimator correlations or new instabilities.

What would settle it

Running DDQL on the same 57 Atari games and still observing high overestimation or no aggregate improvement over Double DQN would falsify the central claim.

Figures

Figures reproduced from arXiv: 2507.00275 by Marlos C. Machado, Martha White, Prabhat Nagarajan.

**Figure 1.** Figure 1: Reciprocal bootstrapping. Each value function bootstraps the other value function. Double estimation with reciprocal bootstrapping Double estimation, in addition to implementing target bootstrap decoupling, has additional requirements to further decorrelate action-selection and action-evaluation in the bootstrap target. In double estimation, two Q-functions are explicitly learned, with bootstrap targ… view at source ↗

**Figure 2.** Figure 2: Network architectures for DQN and DDQL variants. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Final overestimations averaged across five seeds of Double DQN, DH-DDQL, [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Human-normalized scores throughout training. Note that the scale of the y-axes [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Improvement in terms of HNS of DH-DDQL ( [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of DH-DDQL compared to DH-DDQL (double buffer). The algo [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Overestimation of DH-DDQL compared to DH-DDQL (double buffer). The [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of DN-DDQL compared to DN-DDQL (double buffer). DN-DDQL [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Performance of DH-DDQL compared to DH-DDQL(RR = [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Performance of DN-DDQL compared to DN-DDQL(RR = [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Overestimation of DH-DDQL(RR = 1 4 ), DN-DDQL(RR = 1 4 ), and Double DQN. Overestimation is clipped at -8 due to divergence in BattleZone. The DDQL variants continue to reduce overestimation even with double the replay ratio. that on a per-update basis, DDQL is more efficient than Double DQN at credit assignment. Moreover, these results indicate that DDQL benefits greatly from increased stationarity in th… view at source ↗

**Figure 12.** Figure 12: Overestimation of five algorithms on NameThisGame. Increased de-correlation reduces overestimation. Shaded region: 95% confidence interval over five seeds. Our results yields two key insights. The first, though expected, is that progressive de-correlation as per our three defining features generally reduces overestimation. Double DQN, which only implements target bootstrap decoupling, has the least am… view at source ↗

**Figure 13.** Figure 13: Final overestimations (across five seeds) of Double DQN, DH-DDQL, and DN [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗

**Figure 14.** Figure 14: Scores across 50M timesteps across 57 Atari 2600 games. [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗

**Figure 15.** Figure 15: Overestimation across 50M timesteps across 57 Atari 2600 games. [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗

**Figure 16.** Figure 16: A comparison of DN-DDQL with a short target network update interval to [PITH_FULL_IMAGE:figures/full_fig_p041_16.png] view at source ↗

**Figure 17.** Figure 17: Performance of DH-DDQL compared to DH-DDQL [PITH_FULL_IMAGE:figures/full_fig_p043_17.png] view at source ↗

**Figure 18.** Figure 18: Performance of DN-DDQL compared to DN-DDQL [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗

read the original abstract

Double Q-learning is a classical control algorithm that mitigates the maximization bias of Q-learning. To do so, it explicitly trains two independent action-value functions and uses them to decouple action-selection and action-evaluation when computing bootstrap targets. Double DQN adapts target bootstrap decoupling to deep reinforcement learning (RL), but explicitly trains only a single action-value function and does not fully decouple its estimators. Consequently, the two estimators remain correlated, and overestimation persists. In this paper, we introduce Deep Double Q-learning (DDQL), a deep RL algorithm that explicitly trains two Q-functions through Double Q-learning. DDQL stabilizes training through a combination of techniques, including lower replay ratios, longer target network update intervals, and shared layers. Across 57 Atari 2600 games, DDQL improves aggregate performance over Double DQN, outperforming it on 47 games while further reducing overestimation. In addition, we study key design choices when adapting Double Q-learning to deep RL, including the network architecture, replay ratio, and minibatch sampling strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Deep Double Q-learning (DDQL), a deep RL adaptation of classical Double Q-learning that explicitly trains two action-value functions to decouple action selection from evaluation. DDQL employs stabilization techniques including lower replay ratios, longer target-network update intervals, and shared layers between the two Q-networks. On 57 Atari 2600 games, DDQL is reported to outperform Double DQN on 47 games with higher aggregate performance and further reduced overestimation; the paper also examines design choices such as network architecture, replay ratio, and minibatch sampling.

Significance. If the empirical gains prove robust, the work is significant for demonstrating that fuller realization of the Double Q-learning decoupling mechanism can yield measurable improvements over Double DQN in deep settings. The large-scale Atari evaluation and explicit study of stabilization hyperparameters provide practical guidance for mitigating maximization bias. The manuscript does not ship machine-checked proofs or parameter-free derivations, but the reproducible benchmarking protocol on a standard suite is a positive attribute.

major comments (2)

[Network Architecture and Stabilization] Network Architecture and Stabilization section: the claim that shared layers plus lower replay ratio and longer target updates preserve sufficient estimator independence is load-bearing for the central decoupling argument, yet no direct measurement (e.g., correlation between the two Q-head outputs or gradient alignment statistics) or ablation removing the shared backbone is presented. Shared parameters allow gradients from both heads to update the same features, which risks reintroducing the very correlations Double Q-learning is intended to avoid.
[Empirical Results] Empirical Results section (Atari evaluation): the reported 47/57 win rate and aggregate improvement lack error bars across random seeds, confidence intervals, or statistical significance tests. Without these, it is impossible to determine whether the observed gains are distinguishable from training stochasticity or hyperparameter sensitivity, weakening the claim that DDQL reliably outperforms Double DQN.

minor comments (2)

[Abstract] The abstract states that DDQL 'further reduc[es] overestimation' but does not define the precise overestimation metric or show the corresponding plot or table reference.
[Figures] Figure captions for learning curves should explicitly state the number of independent runs and whether shaded regions represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on Deep Double Q-learning. We address each major comment below and describe the revisions we intend to incorporate.

read point-by-point responses

Referee: [Network Architecture and Stabilization] Network Architecture and Stabilization section: the claim that shared layers plus lower replay ratio and longer target updates preserve sufficient estimator independence is load-bearing for the central decoupling argument, yet no direct measurement (e.g., correlation between the two Q-head outputs or gradient alignment statistics) or ablation removing the shared backbone is presented. Shared parameters allow gradients from both heads to update the same features, which risks reintroducing the very correlations Double Q-learning is intended to avoid.

Authors: We agree that direct measurements of estimator independence and an ablation with fully separate backbones would strengthen the central argument. Our design uses shared layers for computational efficiency and feature reuse while relying on separate output heads together with reduced replay ratios and extended target-update intervals to limit correlation; the observed further reduction in overestimation provides indirect support. Nevertheless, the absence of explicit correlation statistics or a no-shared-backbone ablation is a limitation. We will add both an analysis of Q-head output correlations and gradient alignment as well as the requested ablation study in the revised manuscript. revision: yes
Referee: [Empirical Results] Empirical Results section (Atari evaluation): the reported 47/57 win rate and aggregate improvement lack error bars across random seeds, confidence intervals, or statistical significance tests. Without these, it is impossible to determine whether the observed gains are distinguishable from training stochasticity or hyperparameter sensitivity, weakening the claim that DDQL reliably outperforms Double DQN.

Authors: We concur that reporting variability across random seeds and formal statistical comparisons would make the empirical claims more robust. The 47/57 win rate and aggregate scores were obtained from single runs per game, consistent with standard large-scale Atari reporting, yet this practice does leave the results vulnerable to seed-specific effects. We will rerun the full evaluation suite with multiple independent seeds, include error bars and confidence intervals, and add statistical significance tests between DDQL and Double DQN in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical algorithm proposal and benchmarking

full rationale

The paper introduces DDQL by adapting the classical Double Q-learning algorithm (which decouples selection and evaluation via two independent Q-functions) to deep networks, then stabilizes training with replay ratio, target update frequency, and shared layers before reporting Atari 2600 results. No derivation, equation, or 'prediction' is shown to reduce to a fitted parameter or self-citation by construction. All performance claims rest on external benchmark comparisons (57 games) rather than internal self-reference. The skeptic concern about shared layers reintroducing correlations is an assumption-validity issue, not a circularity reduction. This is a standard empirical RL paper with independent external validation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the approach inherits standard RL assumptions about value estimation and adds design choices for stability. Limited information prevents exhaustive listing of all free parameters or axioms.

free parameters (2)

replay ratio
Lower replay ratios chosen to stabilize training of two Q-functions.
target network update interval
Longer intervals used as a stabilization technique.

axioms (1)

domain assumption Explicitly training two independent action-value functions decouples selection and evaluation sufficiently to reduce overestimation in deep RL.
Core premise drawn from classical Double Q-learning and applied here.

pith-pipeline@v0.9.0 · 5705 in / 1202 out tokens · 45813 ms · 2026-05-22T00:19:19.011491+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DDQL maintains two Q-network parameters θ1 and θ2... uses one Q-function to select... and the other to evaluate... LDDQL = L1 + L2
IndisputableMonolith/Foundation/ArithmeticFromLogic LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lower replay ratios, longer target network update intervals, and shared layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 3 internal anchors

[1]

Agarwal, D

R. Agarwal, D. Schuurmans, and M. Norouzi. An Optimistic Perspective on Offline Reinforcement Learning . In International Conference on Machine Learning, 2020

work page 2020
[2]

Agarwal, M

R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. G. Bellemare. Deep Reinforcement Learning at the Edge of the Statistical Precipice . Neural Information Processing Systems, 2021

work page 2021
[3]

Aitchison, P

M. Aitchison, P. Sweetser, and M. Hutter. Atari-5: Distilling the Arcade Learning Environment down to Five Games . In International Conference on Machine Learning, pages 421--438, 2023

work page 2023
[4]

Anschel, N

O. Anschel, N. Baram, and N. Shimkin. Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning . In International Conference on Machine Learning, 2017

work page 2017
[5]

M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents . Journal of Artificial Intelligence Research, 2013

work page 2013
[6]

M. G. Bellemare, W. Dabney, and R. Munos. A Distributional Perspective on Reinforcement Learning . In International Conference on Machine Learning, 2017

work page 2017
[7]

G. Chen. Decorrelated Double Q-learning . arXiv preprint arXiv:2006.06956, 2020

work page arXiv 2006
[8]

X. Chen, C. Wang, Z. Zhou, and K. Ross. Randomized Ensembled Double Q-Learning: Learning Fast Without a Model . In International Conference on Learning Representations, 2021

work page 2021
[9]

Farebrother, J

J. Farebrother, J. Orbay, Q. Vuong, A. Ali Taiga, Y. Chebotar, T. Xiao, A. Irpan, S. Levine, P. S. Castro, A. Faust, A. Kumar, and R. Agarwal. Stop Regressing: Training Value Functions via Classification for Scalable Deep RL . In International Conference on Machine Learning, 2024

work page 2024
[10]

Fedus, P

W. Fedus, P. Ramachandran, R. Agarwal, Y. Bengio, H. Larochelle, M. Rowland, and W. Dabney. Revisiting Fundamentals of Experience Replay . In International Conference on Machine Learning, 2020

work page 2020
[11]

Fujimoto, H

S. Fujimoto, H. van Hoof, and D. Meger. Addressing Function Approximation Error in Actor-Critic Methods . In International Conference on Machine Learning, pages 1587--1596, 2018

work page 2018
[12]

Fujita, P

Y. Fujita, P. Nagarajan, T. Kataoka, and T. Ishikawa. ChainerRL: A Deep Reinforcement Learning Library . Journal of Machine Learning Research, 2021

work page 2021
[13]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft Actor-Critic Algorithms and Applications . arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Hessel, J

M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining Improvements in Deep Reinforcement Learning . In AAAI Conference on Artificial Intelligence, 2018

work page 2018
[15]

G. H. John. When the Best Move Isn’t Optimal: Q-learning with Exploration . In AAAI Conference on Artificial Intelligence, 1994

work page 1994
[16]

Q. Lan, Y. Pan, A. Fyshe, and M. White. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning . In International Conference on Learning Representations, 2020

work page 2020
[17]

LeCun, Y

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015

work page 2015
[18]

L.-J. Lin. Reinforcement Learning and Teaching . In AAAI Conference on Artificial Intelligence, 1991

work page 1991
[19]

L.-J. Lin. Reinforcement Learning for Robots Using Neural Networks. Carnegie Mellon University, 1992

work page 1992
[20]

M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents . Journal of Artificial Intelligence Research, 2018

work page 2018
[21]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning . Nature, 2015

work page 2015
[22]

J. S. Obando-Ceron and P. S. Castro. Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research . In International Conference on Machine Learning, 2021

work page 2021
[23]

Osband, C

I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep Exploration via Bootstrapped DQN . Neural Information Processing Systems, 2016

work page 2016
[24]

Ostrovski, P

G. Ostrovski, P. S. Castro, and W. Dabney. The Difficulty of Passive Learning in Deep Reinforcement Learning . Neural Information Processing Systems, 2021

work page 2021
[25]

Patterson, S

A. Patterson, S. Neumann, M. White, and A. White. Empirical Design in Reinforcement Learning . Journal of Machine Learning Research, 2024

work page 2024
[26]

O. Peer, C. Tessler, N. Merlis, and R. Meir. Ensemble Bootstrapping for Q-Learning . In International Conference on Machine Learning, 2021

work page 2021
[27]

Quan and G

J. Quan and G. Ostrovski. DQN Zoo : Reference implementations of DQN -based agents, 2020. URL http://github.com/deepmind/dqn_zoo

work page 2020
[28]

Schaul, J

T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized Experience Replay . In International Conference on Learning Representations, 2016

work page 2016
[29]

Schaul, A

T. Schaul, A. Barreto, J. Quan, and G. Ostrovski. The Phenomenon of Policy Churn . Neural Information Processing Systems, 2022

work page 2022
[30]

J. E. Smith and R. L. Winkler. The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis . Management Science, 2006

work page 2006
[31]

R. S. Sutton. Learning to Predict by the Methods of Temporal Differences . Machine learning, 1988

work page 1988
[32]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 2018

work page 2018
[33]

Thrun and A

S. Thrun and A. Schwartz. Issues in Using Function Approximation for Reinforcement Learning . In Connectionist Models Summer School, 1993

work page 1993
[34]

Tieleman

T. Tieleman. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 0 (2): 0 26, 2012

work page 2012
[35]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goul \ a o, A. Kallinteris, M. Krimmel, A. KG, et al. Gymnasium: A Standard Interface for Reinforcement Learning Environments . arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

van Hasselt

H. van Hasselt. Double Q-learning . Neural Information Processing Systems, 2010

work page 2010
[37]

van Hasselt, A

H. van Hasselt, A. Guez, and D. Silver. Deep Reinforcement Learning with Double Q-learning . In AAAI Conference on Artificial Intelligence, 2016

work page 2016
[38]

Deep Reinforcement Learning and the Deadly Triad

H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil. Deep Reinforcement Learning and the Deadly Triad . arXiv preprint arXiv:1812.02648, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

van Hasselt, M

H. van Hasselt, M. Hessel, and J. Aslanides. When to use parametric models in reinforcement learning? In Neural Information Processing Systems, 2019

work page 2019
[40]

Wagenbach and M

J. Wagenbach and M. Sabatelli. Factors of Influence of the Overestimation Bias of Q-Learning . arXiv preprint arXiv:2210.05262, 2022

work page arXiv 2022
[41]

Waltz and O

M. Waltz and O. Okhrin. Addressing maximization bias in reinforcement learning with two-sample testing. Artificial Intelligence, 2024

work page 2024
[42]

Wang and A

X. Wang and A. Vinel. Cross Learning in Deep Q-Networks . arXiv preprint arXiv:2009.13780, 2020

work page arXiv 2009
[43]

Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas. Dueling Network Architectures for Deep Reinforcement Learning . In International Conference on Machine Learning, 2016

work page 2016
[44]

C. J. Watkins. Learning from Delayed Rewards . PhD thesis , University of Cambridge, Cambridge, UK, 1989

work page 1989
[45]

C. J. Watkins and P. Dayan. Q-learning. Machine learning, 1992

work page 1992
[46]

Zhu and M

R. Zhu and M. Rigotti. Self-correcting Q-learning . In AAAI Conference on Artificial Intelligence, 2021

work page 2021
[47]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[1] [1]

Agarwal, D

R. Agarwal, D. Schuurmans, and M. Norouzi. An Optimistic Perspective on Offline Reinforcement Learning . In International Conference on Machine Learning, 2020

work page 2020

[2] [2]

Agarwal, M

R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. G. Bellemare. Deep Reinforcement Learning at the Edge of the Statistical Precipice . Neural Information Processing Systems, 2021

work page 2021

[3] [3]

Aitchison, P

M. Aitchison, P. Sweetser, and M. Hutter. Atari-5: Distilling the Arcade Learning Environment down to Five Games . In International Conference on Machine Learning, pages 421--438, 2023

work page 2023

[4] [4]

Anschel, N

O. Anschel, N. Baram, and N. Shimkin. Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning . In International Conference on Machine Learning, 2017

work page 2017

[5] [5]

M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents . Journal of Artificial Intelligence Research, 2013

work page 2013

[6] [6]

M. G. Bellemare, W. Dabney, and R. Munos. A Distributional Perspective on Reinforcement Learning . In International Conference on Machine Learning, 2017

work page 2017

[7] [7]

G. Chen. Decorrelated Double Q-learning . arXiv preprint arXiv:2006.06956, 2020

work page arXiv 2006

[8] [8]

X. Chen, C. Wang, Z. Zhou, and K. Ross. Randomized Ensembled Double Q-Learning: Learning Fast Without a Model . In International Conference on Learning Representations, 2021

work page 2021

[9] [9]

Farebrother, J

J. Farebrother, J. Orbay, Q. Vuong, A. Ali Taiga, Y. Chebotar, T. Xiao, A. Irpan, S. Levine, P. S. Castro, A. Faust, A. Kumar, and R. Agarwal. Stop Regressing: Training Value Functions via Classification for Scalable Deep RL . In International Conference on Machine Learning, 2024

work page 2024

[10] [10]

Fedus, P

W. Fedus, P. Ramachandran, R. Agarwal, Y. Bengio, H. Larochelle, M. Rowland, and W. Dabney. Revisiting Fundamentals of Experience Replay . In International Conference on Machine Learning, 2020

work page 2020

[11] [11]

Fujimoto, H

S. Fujimoto, H. van Hoof, and D. Meger. Addressing Function Approximation Error in Actor-Critic Methods . In International Conference on Machine Learning, pages 1587--1596, 2018

work page 2018

[12] [12]

Fujita, P

Y. Fujita, P. Nagarajan, T. Kataoka, and T. Ishikawa. ChainerRL: A Deep Reinforcement Learning Library . Journal of Machine Learning Research, 2021

work page 2021

[13] [13]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft Actor-Critic Algorithms and Applications . arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Hessel, J

M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining Improvements in Deep Reinforcement Learning . In AAAI Conference on Artificial Intelligence, 2018

work page 2018

[15] [15]

G. H. John. When the Best Move Isn’t Optimal: Q-learning with Exploration . In AAAI Conference on Artificial Intelligence, 1994

work page 1994

[16] [16]

Q. Lan, Y. Pan, A. Fyshe, and M. White. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning . In International Conference on Learning Representations, 2020

work page 2020

[17] [17]

LeCun, Y

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015

work page 2015

[18] [18]

L.-J. Lin. Reinforcement Learning and Teaching . In AAAI Conference on Artificial Intelligence, 1991

work page 1991

[19] [19]

L.-J. Lin. Reinforcement Learning for Robots Using Neural Networks. Carnegie Mellon University, 1992

work page 1992

[20] [20]

M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents . Journal of Artificial Intelligence Research, 2018

work page 2018

[21] [21]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning . Nature, 2015

work page 2015

[22] [22]

J. S. Obando-Ceron and P. S. Castro. Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research . In International Conference on Machine Learning, 2021

work page 2021

[23] [23]

Osband, C

I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep Exploration via Bootstrapped DQN . Neural Information Processing Systems, 2016

work page 2016

[24] [24]

Ostrovski, P

G. Ostrovski, P. S. Castro, and W. Dabney. The Difficulty of Passive Learning in Deep Reinforcement Learning . Neural Information Processing Systems, 2021

work page 2021

[25] [25]

Patterson, S

A. Patterson, S. Neumann, M. White, and A. White. Empirical Design in Reinforcement Learning . Journal of Machine Learning Research, 2024

work page 2024

[26] [26]

O. Peer, C. Tessler, N. Merlis, and R. Meir. Ensemble Bootstrapping for Q-Learning . In International Conference on Machine Learning, 2021

work page 2021

[27] [27]

Quan and G

J. Quan and G. Ostrovski. DQN Zoo : Reference implementations of DQN -based agents, 2020. URL http://github.com/deepmind/dqn_zoo

work page 2020

[28] [28]

Schaul, J

T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized Experience Replay . In International Conference on Learning Representations, 2016

work page 2016

[29] [29]

Schaul, A

T. Schaul, A. Barreto, J. Quan, and G. Ostrovski. The Phenomenon of Policy Churn . Neural Information Processing Systems, 2022

work page 2022

[30] [30]

J. E. Smith and R. L. Winkler. The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis . Management Science, 2006

work page 2006

[31] [31]

R. S. Sutton. Learning to Predict by the Methods of Temporal Differences . Machine learning, 1988

work page 1988

[32] [32]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 2018

work page 2018

[33] [33]

Thrun and A

S. Thrun and A. Schwartz. Issues in Using Function Approximation for Reinforcement Learning . In Connectionist Models Summer School, 1993

work page 1993

[34] [34]

Tieleman

T. Tieleman. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 0 (2): 0 26, 2012

work page 2012

[35] [35]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goul \ a o, A. Kallinteris, M. Krimmel, A. KG, et al. Gymnasium: A Standard Interface for Reinforcement Learning Environments . arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

van Hasselt

H. van Hasselt. Double Q-learning . Neural Information Processing Systems, 2010

work page 2010

[37] [37]

van Hasselt, A

H. van Hasselt, A. Guez, and D. Silver. Deep Reinforcement Learning with Double Q-learning . In AAAI Conference on Artificial Intelligence, 2016

work page 2016

[38] [38]

Deep Reinforcement Learning and the Deadly Triad

H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil. Deep Reinforcement Learning and the Deadly Triad . arXiv preprint arXiv:1812.02648, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

van Hasselt, M

H. van Hasselt, M. Hessel, and J. Aslanides. When to use parametric models in reinforcement learning? In Neural Information Processing Systems, 2019

work page 2019

[40] [40]

Wagenbach and M

J. Wagenbach and M. Sabatelli. Factors of Influence of the Overestimation Bias of Q-Learning . arXiv preprint arXiv:2210.05262, 2022

work page arXiv 2022

[41] [41]

Waltz and O

M. Waltz and O. Okhrin. Addressing maximization bias in reinforcement learning with two-sample testing. Artificial Intelligence, 2024

work page 2024

[42] [42]

Wang and A

X. Wang and A. Vinel. Cross Learning in Deep Q-Networks . arXiv preprint arXiv:2009.13780, 2020

work page arXiv 2009

[43] [43]

Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas. Dueling Network Architectures for Deep Reinforcement Learning . In International Conference on Machine Learning, 2016

work page 2016

[44] [44]

C. J. Watkins. Learning from Delayed Rewards . PhD thesis , University of Cambridge, Cambridge, UK, 1989

work page 1989

[45] [45]

C. J. Watkins and P. Dayan. Q-learning. Machine learning, 1992

work page 1992

[46] [46]

Zhu and M

R. Zhu and M. Rigotti. Self-correcting Q-learning . In AAAI Conference on Artificial Intelligence, 2021

work page 2021

[47] [47]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page