Benchmarking Batch Deep Reinforcement Learning Algorithms
Pith reviewed 2026-05-17 19:24 UTC · model grok-4.3
The pith
Many batch deep RL algorithms underperform online DQN and the behavioral policy itself when trained on fixed Atari data from one policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under unified batch settings on Atari with data from a single partially-trained policy, most existing algorithms underperform both online-trained DQN using the same data volume and the behavioral policy itself; however, an adaptation of Batch-Constrained Q-learning to discrete actions outperforms all compared algorithms.
What carries the argument
The Batch-Constrained Q-learning algorithm adapted to discrete action spaces, which restricts action selection to those likely under the behavioral policy in order to reduce extrapolation error in the Q-function.
Load-bearing premise
That data generated by a single partially-trained behavioral policy under unified settings produces a representative and fair testbed for comparing batch RL algorithms.
What would settle it
Re-running the full suite of algorithms on datasets collected from a fully trained expert policy or from several different policies and checking whether the performance ordering, especially the superiority of adapted BCQ, remains the same.
read the original abstract
Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks recent off-policy and batch deep RL algorithms on Atari games using a fixed dataset generated by a single partially-trained DQN behavioral policy. Under this unified setup, it reports that many algorithms underperform both an online DQN trained on equivalent data volume and the behavioral policy itself. The authors adapt Batch-Constrained Q-learning (BCQ) to the discrete-action case and show that the adapted version outperforms the other methods evaluated.
Significance. If the empirical ordering holds under the stated conditions, the work is significant for highlighting limitations of existing batch RL approaches on partially-trained policy data and for supplying a competitive discrete-action baseline. The standardized Atari testbed and explicit scoping to a single-policy batch are strengths that aid reproducibility and future comparisons. The stress-test concern about representativeness does not undermine the central claim because the manuscript consistently qualifies results as holding 'under these conditions' rather than asserting universality across all batch distributions.
major comments (2)
- [§4] §4 (Experimental Results), Table 2: the claim that adapted BCQ 'outperforms all existing algorithms' is presented without reported standard errors, number of random seeds, or statistical tests; this weakens the load-bearing performance ordering because small differences could be due to run-to-run variance.
- [§3.2] §3.2 (Data Generation): the behavioral policy is trained for a fixed but unspecified duration before data collection; because this controls the degree of off-policyness and state-action coverage, the relative advantage of constraint-based methods such as adapted BCQ may be sensitive to this choice, yet no sensitivity analysis is provided.
minor comments (3)
- [Abstract] The abstract and §1 should explicitly state the number of games and total transitions used so readers can immediately gauge scale.
- [Figures] Figure captions (e.g., Figure 3) would benefit from clarifying whether scores are normalized per-game or aggregated, and whether the behavioral policy line is the same across panels.
- [§5] A short paragraph in §5 discussing how the single-policy batch differs from expert or multi-policy batches would help readers interpret the scope without altering the central claim.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and recommendation of minor revision. We have carefully considered the two major comments and revised the manuscript to address them directly while preserving the scope and claims of the original work.
read point-by-point responses
-
Referee: §4 (Experimental Results), Table 2: the claim that adapted BCQ 'outperforms all existing algorithms' is presented without reported standard errors, number of random seeds, or statistical tests; this weakens the load-bearing performance ordering because small differences could be due to run-to-run variance.
Authors: We agree that explicitly reporting the number of random seeds and standard errors improves the robustness of the performance comparison. In the revised manuscript we have updated Table 2 to include standard errors computed across the independent runs performed for each algorithm. The performance margins for adapted BCQ are substantially larger than these standard errors on the majority of games, which supports the reported ordering. We have also added a short paragraph in Section 4 noting the observed run-to-run variability and explaining why formal statistical tests were not included: the primary goal of the study is to demonstrate broad underperformance of existing methods relative to both the behavioral policy and online DQN rather than to establish pairwise statistical significance. revision: yes
-
Referee: §3.2 (Data Generation): the behavioral policy is trained for a fixed but unspecified duration before data collection; because this controls the degree of off-policyness and state-action coverage, the relative advantage of constraint-based methods such as adapted BCQ may be sensitive to this choice, yet no sensitivity analysis is provided.
Authors: We thank the referee for highlighting the need for greater clarity on this experimental detail. The revised manuscript now explicitly states the behavioral policy training duration (10 million frames) in Section 3.2 and explains the rationale for selecting a partially-trained policy. A comprehensive sensitivity analysis across multiple training horizons would require a substantial additional experimental budget and lies outside the intended scope of this benchmarking paper, which focuses on a single, reproducible batch-generation protocol. We have added a brief discussion acknowledging that results may vary with different degrees of off-policyness and identifying this as an interesting direction for follow-up work. revision: partial
Circularity Check
No significant circularity: empirical benchmarking with direct experimental comparisons
full rationale
This is a purely empirical benchmarking study that runs existing algorithms (including an adaptation of BCQ) on fixed Atari batches generated by one partially-trained policy and reports performance numbers. No derivations, equations, or fitted parameters are presented as predictions that reduce to the inputs by construction. No self-citations are used as load-bearing justification for the ordering of results. The central claims rest on direct experimental outcomes under the stated unified settings rather than any internal reuse or renaming of prior quantities.
Axiom & Free-Parameter Ledger
free parameters (1)
- Behavioral policy training duration
axioms (1)
- domain assumption Atari environments can be treated as finite-horizon MDPs with discrete actions and deterministic transitions given the action
Forward citations
Cited by 17 Pith papers
-
Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning
ALaM stabilizes state-wise multiplier networks in safe RL via quadratic penalties and supervised regression on dual targets, guaranteeing multiplier convergence and optimal constrained policies when combined with SAC.
-
Improving Feasibility via Fast Autoencoder-Based Projections
An adversarially trained autoencoder learns a convex latent space to enable rapid approximate projections that enforce nonconvex constraints in optimization and reinforcement learning.
-
Fatigue-Aware Learning to Defer via Constrained Optimisation
FALCON incorporates psychologically grounded fatigue curves into learning-to-defer via a CMDP formulation and PPO-Lagrangian optimization, outperforming prior L2D methods and generalizing to unseen fatigue patterns on...
-
Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability
Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than be...
-
Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation
CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.
-
Optimal design of solar-battery hybrid resources considering multi-market participation under weather and price uncertainty
A deep reinforcement learning co-optimization framework is developed for jointly sizing solar-battery hybrids and determining their multi-market bidding strategies under stochastic weather and price conditions.
-
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
Introduces RAPCs and a contraction Bellman operator that jointly enforce probabilistic reach-avoid constraints while minimizing expected costs in stochastic RL, with almost-sure convergence to local optima.
-
Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs
An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.
-
Shaping Zero-Shot Coordination via State Blocking
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
-
Learned Lyapunov Shielding for Adaptive Control
Learned Lyapunov functions, residual SAC policies, and PINNs are combined with a Slotine-Li controller and a closed-form safety filter to improve tracking on uncertain Euler-Lagrange systems while retaining stability ...
-
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
-
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.
-
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
-
Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics
SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.
-
CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning
CAPSULE learns probabilistic control-affine dynamics offline to construct uncertainty-incorporating control barrier functions that enforce conservative safety constraints via online action correction in reinforcement ...
-
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production
PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.
Reference graph
Works this paper leans on
-
[1]
Striving for simplicity in off-policy deep reinforcement learning
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in off-policy deep reinforcement learning. arXiv preprint arXiv:1907.04543,
-
[2]
Exploration by Random Network Distillation
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Dopamine: A Research Framework for Deep Reinforcement Learning
Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Belle- mare. Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Distributional Reinforcement Learning with Quantile Regression
Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. arXiv preprint arXiv:1710.10044,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Off-policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062,
work page 2052
-
[6]
Horizon: Facebook’s open source applied reinforcement learning platform
Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye, Zhengxing Chen, and Scott Fujimoto. Horizon: Facebook’s open source applied reinforcement learning platform. arXiv preprint arXiv:1811.00260,
-
[7]
Stable function approximation in dynamic programming
Geoffrey J Gordon. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995, pages 261–268. Elsevier,
work page 1995
-
[8]
Rainbow: Combining Improvements in Deep Reinforcement Learning
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[10]
Adam: A Method for Stochastic Optimization
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Stabilizing off-policy q-learning via bootstrapping error reduction
Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949,
-
[12]
Continuous control with deep reinforcement learning
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Safe policy improvement with an estimated baseline policy
10 Thiago D Simão, Romain Laroche, and Rémi Tachet des Combes. Safe policy improvement with an estimated baseline policy. arXiv preprint arXiv:1909.05236,
-
[14]
Dropout: a simple way to prevent neural networks from overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958,
work page 1929
-
[15]
Issues in using function approximation for reinforcement learning
Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum,
work page 1993
-
[16]
Deep reinforcement learning with double q- learning
Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q- learning. In AAAI, pages 2094–2100,
work page 2094
-
[17]
11 A Experimental Details A.1 Atari Preprocessing. The Atari 2600 environment is preprocessed in the same manner as previous work [Mnih et al., 2015, Machado et al., 2018, Castro et al., 2018] and we use consistent preprocessing across all tasks and algorithms. We denote the output of the Atari environment as frames. These frames are grayscaled and resize...
work page 2015
-
[18]
Table 1: Hyper-parameters used by each network
Hyper-parameters were chosen to match the implementation of Rainbow [Hessel et al., 2017] in the Dopamine frame- work [Castro et al., 2018]. Table 1: Hyper-parameters used by each network. Hyper-parameter Value Network optimizer Adam [Kingma and Ba, 2014] Learning rate 0.0000625 Adamϵ 0.00015 Discountγ 0.99 Mini-batch size 32 Target network update frequen...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.