pith. sign in

hub Mixed citations

Benchmarking Batch Deep Reinforcement Learning Algorithms

Mixed citation behavior. Most common role is background (67%).

21 Pith papers citing it
Background 67% of classified citations
abstract

Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.

hub tools

citation-role summary

background 4 dataset 1 method 1

citation-polarity summary

representative citing papers

Fatigue-Aware Learning to Defer via Constrained Optimisation

cs.LG · 2026-04-01 · unverdicted · novelty 7.0

FALCON incorporates psychologically grounded fatigue curves into learning-to-defer via a CMDP formulation and PPO-Lagrangian optimization, outperforming prior L2D methods and generalizing to unseen fatigue patterns on the new FA-L2D benchmark.

Generative Auto-Bidding with Unified Modeling and Exploration

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

GUIDE integrates a Decision Transformer for joint modeling of bidding actions and states with Q-value regularization for exploration and an IDM for safe policy fallback, outperforming baselines in simulations and real Taobao deployment with gains in GMV, clicks, cost, and ROI.

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.

Shaping Zero-Shot Coordination via State Blocking

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.

Why Does Agentic Safety Fail to Generalize Across Tasks?

cs.LG · 2026-05-07 · conditional · novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.

Learned Lyapunov Shielding for Adaptive Control

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Learned Lyapunov functions, residual SAC policies, and PINNs are combined with a Slotine-Li controller and a closed-form safety filter to improve tracking on uncertain Euler-Lagrange systems while retaining stability guarantees.

citing papers explorer

Showing 21 of 21 citing papers.