arxiv: 1910.01708 · v1 · pith:TTCAXWKNnew · submitted 2019-10-03 · 💻 cs.LG · cs.AI· stat.ML

Benchmarking Batch Deep Reinforcement Learning Algorithms

Scott Fujimoto , Edoardo Conti , Mohammad Ghavamzadeh , Joelle Pineau This is my paper

Pith reviewed 2026-05-17 19:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords batch reinforcement learningdeep Q-learningAtarioff-policy learningBCQfixed datasetbenchmark

0 comments

The pith

Many batch deep RL algorithms underperform online DQN and the behavioral policy itself when trained on fixed Atari data from one policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests recent off-policy and batch reinforcement learning algorithms on Atari games using one fixed dataset collected by a single partially trained behavioral policy. It reports that under these consistent conditions most algorithms fall short of both a DQN trained online with the same quantity of data and the original behavioral policy. The authors adapt Batch-Constrained Q-learning to discrete actions and show that the resulting method beats every other algorithm examined. The work therefore supplies both a cautionary benchmark and a stronger baseline for learning from static experience without further environment interaction.

Core claim

Under unified batch settings on Atari with data from a single partially-trained policy, most existing algorithms underperform both online-trained DQN using the same data volume and the behavioral policy itself; however, an adaptation of Batch-Constrained Q-learning to discrete actions outperforms all compared algorithms.

What carries the argument

The Batch-Constrained Q-learning algorithm adapted to discrete action spaces, which restricts action selection to those likely under the behavioral policy in order to reduce extrapolation error in the Q-function.

Load-bearing premise

That data generated by a single partially-trained behavioral policy under unified settings produces a representative and fair testbed for comparing batch RL algorithms.

What would settle it

Re-running the full suite of algorithms on datasets collected from a fully trained expert policy or from several different policies and checking whether the performance ordering, especially the superiority of adapted BCQ, remains the same.

read the original abstract

Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper benchmarks recent off-policy and batch deep RL algorithms on Atari games using a fixed dataset generated by a single partially-trained DQN behavioral policy. Under this unified setup, it reports that many algorithms underperform both an online DQN trained on equivalent data volume and the behavioral policy itself. The authors adapt Batch-Constrained Q-learning (BCQ) to the discrete-action case and show that the adapted version outperforms the other methods evaluated.

Significance. If the empirical ordering holds under the stated conditions, the work is significant for highlighting limitations of existing batch RL approaches on partially-trained policy data and for supplying a competitive discrete-action baseline. The standardized Atari testbed and explicit scoping to a single-policy batch are strengths that aid reproducibility and future comparisons. The stress-test concern about representativeness does not undermine the central claim because the manuscript consistently qualifies results as holding 'under these conditions' rather than asserting universality across all batch distributions.

major comments (2)

[§4] §4 (Experimental Results), Table 2: the claim that adapted BCQ 'outperforms all existing algorithms' is presented without reported standard errors, number of random seeds, or statistical tests; this weakens the load-bearing performance ordering because small differences could be due to run-to-run variance.
[§3.2] §3.2 (Data Generation): the behavioral policy is trained for a fixed but unspecified duration before data collection; because this controls the degree of off-policyness and state-action coverage, the relative advantage of constraint-based methods such as adapted BCQ may be sensitive to this choice, yet no sensitivity analysis is provided.

minor comments (3)

[Abstract] The abstract and §1 should explicitly state the number of games and total transitions used so readers can immediately gauge scale.
[Figures] Figure captions (e.g., Figure 3) would benefit from clarifying whether scores are normalized per-game or aggregated, and whether the behavioral policy line is the same across panels.
[§5] A short paragraph in §5 discussing how the single-policy batch differs from expert or multi-policy batches would help readers interpret the scope without altering the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation of minor revision. We have carefully considered the two major comments and revised the manuscript to address them directly while preserving the scope and claims of the original work.

read point-by-point responses

Referee: §4 (Experimental Results), Table 2: the claim that adapted BCQ 'outperforms all existing algorithms' is presented without reported standard errors, number of random seeds, or statistical tests; this weakens the load-bearing performance ordering because small differences could be due to run-to-run variance.

Authors: We agree that explicitly reporting the number of random seeds and standard errors improves the robustness of the performance comparison. In the revised manuscript we have updated Table 2 to include standard errors computed across the independent runs performed for each algorithm. The performance margins for adapted BCQ are substantially larger than these standard errors on the majority of games, which supports the reported ordering. We have also added a short paragraph in Section 4 noting the observed run-to-run variability and explaining why formal statistical tests were not included: the primary goal of the study is to demonstrate broad underperformance of existing methods relative to both the behavioral policy and online DQN rather than to establish pairwise statistical significance. revision: yes
Referee: §3.2 (Data Generation): the behavioral policy is trained for a fixed but unspecified duration before data collection; because this controls the degree of off-policyness and state-action coverage, the relative advantage of constraint-based methods such as adapted BCQ may be sensitive to this choice, yet no sensitivity analysis is provided.

Authors: We thank the referee for highlighting the need for greater clarity on this experimental detail. The revised manuscript now explicitly states the behavioral policy training duration (10 million frames) in Section 3.2 and explains the rationale for selecting a partially-trained policy. A comprehensive sensitivity analysis across multiple training horizons would require a substantial additional experimental budget and lies outside the intended scope of this benchmarking paper, which focuses on a single, reproducible batch-generation protocol. We have added a brief discussion acknowledging that results may vary with different degrees of off-policyness and identifying this as an interesting direction for follow-up work. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical benchmarking with direct experimental comparisons

full rationale

This is a purely empirical benchmarking study that runs existing algorithms (including an adaptation of BCQ) on fixed Atari batches generated by one partially-trained policy and reports performance numbers. No derivations, equations, or fitted parameters are presented as predictions that reduce to the inputs by construction. No self-citations are used as load-bearing justification for the ordering of results. The central claims rest on direct experimental outcomes under the stated unified settings rather than any internal reuse or renaming of prior quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study; the central claims rest on standard reinforcement-learning assumptions and experimental comparisons rather than new theoretical derivations.

free parameters (1)

Behavioral policy training duration
The policy is described only as 'partially-trained'; the exact training horizon or performance level is a free choice that defines the batch data distribution.

axioms (1)

domain assumption Atari environments can be treated as finite-horizon MDPs with discrete actions and deterministic transitions given the action
Invoked by the choice of Atari benchmark and the use of standard Q-learning style updates.

pith-pipeline@v0.9.0 · 5430 in / 1293 out tokens · 142052 ms · 2026-05-17T19:24:29.450533+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

ALaM stabilizes state-wise multiplier networks in safe RL via quadratic penalties and supervised regression on dual targets, guaranteeing multiplier convergence and optimal constrained policies when combined with SAC.
Improving Feasibility via Fast Autoencoder-Based Projections
cs.LG 2026-04 unverdicted novelty 7.0

An adversarially trained autoencoder learns a convex latent space to enable rapid approximate projections that enforce nonconvex constraints in optimization and reinforcement learning.
Fatigue-Aware Learning to Defer via Constrained Optimisation
cs.LG 2026-04 unverdicted novelty 7.0

FALCON incorporates psychologically grounded fatigue curves into learning-to-defer via a CMDP formulation and PPO-Lagrangian optimization, outperforming prior L2D methods and generalizing to unseen fatigue patterns on...
Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability
cs.LG 2026-05 unverdicted novelty 6.0

Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than be...
Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation
cs.RO 2026-05 unverdicted novelty 6.0

CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.
Optimal design of solar-battery hybrid resources considering multi-market participation under weather and price uncertainty
eess.SY 2026-05 unverdicted novelty 6.0

A deep reinforcement learning co-optimization framework is developed for jointly sizing solar-battery hybrids and determining their multi-market bidding strategies under stochastic weather and price conditions.
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Introduces RAPCs and a contraction Bellman operator that jointly enforce probabilistic reach-avoid constraints while minimizing expected costs in stochastic RL, with almost-sure convergence to local optima.
Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs
cs.LG 2026-05 unverdicted novelty 6.0

An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.
Shaping Zero-Shot Coordination via State Blocking
cs.LG 2026-05 unverdicted novelty 6.0

SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
Learned Lyapunov Shielding for Adaptive Control
cs.LG 2026-05 unverdicted novelty 6.0

Learned Lyapunov functions, residual SAC policies, and PINNs are combined with a Slotine-Li controller and a closed-form safety filter to improve tracking on uncertain Euler-Lagrange systems while retaining stability ...
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
cs.RO 2026-04 unverdicted novelty 6.0

CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
cs.RO 2025-10 unverdicted novelty 6.0

A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics
cs.LG 2026-05 unverdicted novelty 5.0

SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.
CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

CAPSULE learns probabilistic control-affine dynamics offline to construct uncertainty-incorporating control barrier functions that enforce conservative safety constraints via online action correction in reinforcement ...
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production
cs.AI 2026-04 unverdicted novelty 5.0

PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 17 Pith papers · 7 internal anchors

[1]

Striving for simplicity in off-policy deep reinforcement learning

Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in off-policy deep reinforcement learning. arXiv preprint arXiv:1907.04543,

work page arXiv 1907
[2]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Dopamine: A Research Framework for Deep Reinforcement Learning

Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Belle- mare. Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Distributional Reinforcement Learning with Quantile Regression

Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. arXiv preprint arXiv:1710.10044,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062,

work page 2052
[6]

Horizon: Facebook’s open source applied reinforcement learning platform

Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye, Zhengxing Chen, and Scott Fujimoto. Horizon: Facebook’s open source applied reinforcement learning platform. arXiv preprint arXiv:1811.00260,

work page arXiv
[7]

Stable function approximation in dynamic programming

Geoffrey J Gordon. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995, pages 261–268. Elsevier,

work page 1995
[8]

Rainbow: Combining Improvements in Deep Reinforcement Learning

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[10]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Stabilizing off-policy q-learning via bootstrapping error reduction

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949,

work page arXiv 1906
[12]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Safe policy improvement with an estimated baseline policy

10 Thiago D Simão, Romain Laroche, and Rémi Tachet des Combes. Safe policy improvement with an estimated baseline policy. arXiv preprint arXiv:1909.05236,

work page arXiv 1909
[14]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929–1958,

work page 1929
[15]

Issues in using function approximation for reinforcement learning

Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum,

work page 1993
[16]

Deep reinforcement learning with double q- learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q- learning. In AAAI, pages 2094–2100,

work page 2094
[17]

11 A Experimental Details A.1 Atari Preprocessing. The Atari 2600 environment is preprocessed in the same manner as previous work [Mnih et al., 2015, Machado et al., 2018, Castro et al., 2018] and we use consistent preprocessing across all tasks and algorithms. We denote the output of the Atari environment as frames. These frames are grayscaled and resize...

work page 2015
[18]

Table 1: Hyper-parameters used by each network

Hyper-parameters were chosen to match the implementation of Rainbow [Hessel et al., 2017] in the Dopamine frame- work [Castro et al., 2018]. Table 1: Hyper-parameters used by each network. Hyper-parameter Value Network optimizer Adam [Kingma and Ba, 2014] Learning rate 0.0000625 Adamϵ 0.00015 Discountγ 0.99 Mini-batch size 32 Target network update frequen...

work page 2017