hub Mixed citations

Benchmarking Batch Deep Reinforcement Learning Algorithms

Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, Joelle Pineau · 2019 · cs.LG · arXiv 1910.01708

Mixed citation behavior. Most common role is background (67%).

21 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 21 citing papers arXiv PDF

abstract

Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 1 method 1

citation-polarity summary

background 4 use dataset 1 use method 1

representative citing papers

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

ALaM stabilizes state-wise multiplier networks in safe RL via quadratic penalties and supervised regression on dual targets, guaranteeing multiplier convergence and optimal constrained policies when combined with SAC.

Improving Feasibility via Fast Autoencoder-Based Projections

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

An adversarially trained autoencoder learns a convex latent space to enable rapid approximate projections that enforce nonconvex constraints in optimization and reinforcement learning.

Fatigue-Aware Learning to Defer via Constrained Optimisation

cs.LG · 2026-04-01 · unverdicted · novelty 7.0

FALCON incorporates psychologically grounded fatigue curves into learning-to-defer via a CMDP formulation and PPO-Lagrangian optimization, outperforming prior L2D methods and generalizing to unseen fatigue patterns on the new FA-L2D benchmark.

Generative Auto-Bidding with Unified Modeling and Exploration

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

GUIDE integrates a Decision Transformer for joint modeling of bidding actions and states with Q-value regularization for exploration and an IDM for safe policy fallback, outperforming baselines in simulations and real Taobao deployment with gains in GMV, clicks, cost, and ROI.

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than belief planning baselines.

Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.

Optimal design of solar-battery hybrid resources considering multi-market participation under weather and price uncertainty

eess.SY · 2026-05-13 · unverdicted · novelty 6.0

A deep reinforcement learning co-optimization framework is developed for jointly sizing solar-battery hybrids and determining their multi-market bidding strategies under stochastic weather and price conditions.

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.

Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.

Shaping Zero-Shot Coordination via State Blocking

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.

Why Does Agentic Safety Fail to Generalize Across Tasks?

cs.LG · 2026-05-07 · conditional · novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.

Learned Lyapunov Shielding for Adaptive Control

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Learned Lyapunov functions, residual SAC policies, and PINNs are combined with a Slotine-Li controller and a closed-form safety filter to improve tracking on uncertain Euler-Lagrange systems while retaining stability guarantees.

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.

CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection

cs.RO · 2026-04-08 · unverdicted · novelty 6.0

CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.

Constraint-Aware Reinforcement Learning via Adaptive Action Scaling

cs.RO · 2025-10-13 · unverdicted · novelty 6.0

A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.

Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.

CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning

cs.LG · 2026-04-26 · unverdicted · novelty 5.0

CAPSULE learns probabilistic control-affine dynamics offline to construct uncertainty-incorporating control barrier functions that enforce conservative safety constraints via online action correction in reinforcement learning.

Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

cs.AI · 2026-04-14 · unverdicted · novelty 5.0

PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

cs.AI · 2026-02-13 · unverdicted · novelty 5.0

MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.

Analyzing Adversarial Inputs in Deep Reinforcement Learning

cs.LG · 2024-02-07 · unverdicted · novelty 5.0

Introduces the Adversarial Rate metric and associated tools to systematically evaluate and visualize the impact of adversarial inputs on DRL policies using formal verification.

A Review On Safe Reinforcement Learning Using Lyapunov and Barrier Functions

eess.SY · 2025-08-12 · unverdicted · novelty 2.0

A literature review of safe RL using Lyapunov and barrier functions that identifies a shift to model-free methods since 2017, well-defined open problems per approach class, and high-dimensional scalability as the main barrier.

citing papers explorer

Showing 21 of 21 citing papers.

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 4 · internal anchor
ALaM stabilizes state-wise multiplier networks in safe RL via quadratic penalties and supervised regression on dual targets, guaranteeing multiplier convergence and optimal constrained policies when combined with SAC.
Improving Feasibility via Fast Autoencoder-Based Projections cs.LG · 2026-04-03 · unverdicted · none · ref 9 · internal anchor
An adversarially trained autoencoder learns a convex latent space to enable rapid approximate projections that enforce nonconvex constraints in optimization and reinforcement learning.
Fatigue-Aware Learning to Defer via Constrained Optimisation cs.LG · 2026-04-01 · unverdicted · none · ref 34 · internal anchor
FALCON incorporates psychologically grounded fatigue curves into learning-to-defer via a CMDP formulation and PPO-Lagrangian optimization, outperforming prior L2D methods and generalizing to unseen fatigue patterns on the new FA-L2D benchmark.
Generative Auto-Bidding with Unified Modeling and Exploration cs.AI · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
GUIDE integrates a Decision Transformer for joint modeling of bidding actions and states with Q-value regularization for exploration and an IDM for safe policy fallback, outperforming baselines in simulations and real Taobao deployment with gains in GMV, clicks, cost, and ROI.
Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability cs.LG · 2026-05-14 · unverdicted · none · ref 21 · internal anchor
Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than belief planning baselines.
Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation cs.RO · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.
Optimal design of solar-battery hybrid resources considering multi-market participation under weather and price uncertainty eess.SY · 2026-05-13 · unverdicted · none · ref 26 · internal anchor
A deep reinforcement learning co-optimization framework is developed for jointly sizing solar-battery hybrids and determining their multi-market bidding strategies under stochastic weather and price conditions.
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 8 · 2 links · internal anchor
Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.
Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs cs.LG · 2026-05-12 · unverdicted · none · ref 13 · internal anchor
An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.
Shaping Zero-Shot Coordination via State Blocking cs.LG · 2026-05-12 · unverdicted · none · ref 40 · internal anchor
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 93 · internal anchor
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
Learned Lyapunov Shielding for Adaptive Control cs.LG · 2026-05-07 · unverdicted · none · ref 24 · internal anchor
Learned Lyapunov functions, residual SAC policies, and PINNs are combined with a Slotine-Li controller and a closed-form safety filter to improve tracking on uncertain Euler-Lagrange systems while retaining stability guarantees.
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning cs.RO · 2026-05-06 · unverdicted · none · ref 69 · internal anchor
Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection cs.RO · 2026-04-08 · unverdicted · none · ref 38 · internal anchor
CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling cs.RO · 2025-10-13 · unverdicted · none · ref 16 · internal anchor
A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics cs.LG · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.
CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning cs.LG · 2026-04-26 · unverdicted · none · ref 9 · internal anchor
CAPSULE learns probabilistic control-affine dynamics offline to construct uncertainty-incorporating control barrier functions that enforce conservative safety constraints via online action correction in reinforcement learning.
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production cs.AI · 2026-04-14 · unverdicted · none · ref 67 · internal anchor
PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.
MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents cs.AI · 2026-02-13 · unverdicted · none · ref 79 · internal anchor
MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.
Analyzing Adversarial Inputs in Deep Reinforcement Learning cs.LG · 2024-02-07 · unverdicted · none · ref 16 · internal anchor
Introduces the Adversarial Rate metric and associated tools to systematically evaluate and visualize the impact of adversarial inputs on DRL policies using formal verification.
A Review On Safe Reinforcement Learning Using Lyapunov and Barrier Functions eess.SY · 2025-08-12 · unverdicted · none · ref 129 · internal anchor
A literature review of safe RL using Lyapunov and barrier functions that identifies a shift to model-free methods since 2017, well-defined open problems per approach class, and high-dimensional scalability as the main barrier.

Benchmarking Batch Deep Reinforcement Learning Algorithms

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer