ALaM stabilizes state-wise multiplier networks in safe RL via quadratic penalties and supervised regression on dual targets, guaranteeing multiplier convergence and optimal constrained policies when combined with SAC.
hub Mixed citations
Benchmarking Batch Deep Reinforcement Learning Algorithms
Mixed citation behavior. Most common role is background (67%).
abstract
Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
An adversarially trained autoencoder learns a convex latent space to enable rapid approximate projections that enforce nonconvex constraints in optimization and reinforcement learning.
FALCON incorporates psychologically grounded fatigue curves into learning-to-defer via a CMDP formulation and PPO-Lagrangian optimization, outperforming prior L2D methods and generalizing to unseen fatigue patterns on the new FA-L2D benchmark.
GUIDE integrates a Decision Transformer for joint modeling of bidding actions and states with Q-value regularization for exploration and an IDM for safe policy fallback, outperforming baselines in simulations and real Taobao deployment with gains in GMV, clicks, cost, and ROI.
Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than belief planning baselines.
CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.
A deep reinforcement learning co-optimization framework is developed for jointly sizing solar-battery hybrids and determining their multi-market bidding strategies under stochastic weather and price conditions.
Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.
An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
Learned Lyapunov functions, residual SAC policies, and PINNs are combined with a Slotine-Li controller and a closed-form safety filter to improve tracking on uncertain Euler-Lagrange systems while retaining stability guarantees.
Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.
A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.
CAPSULE learns probabilistic control-affine dynamics offline to construct uncertainty-incorporating control barrier functions that enforce conservative safety constraints via online action correction in reinforcement learning.
PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.
MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.
Introduces the Adversarial Rate metric and associated tools to systematically evaluate and visualize the impact of adversarial inputs on DRL policies using formal verification.
A literature review of safe RL using Lyapunov and barrier functions that identifies a shift to model-free methods since 2017, well-defined open problems per approach class, and high-dimensional scalability as the main barrier.
citing papers explorer
-
Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning
ALaM stabilizes state-wise multiplier networks in safe RL via quadratic penalties and supervised regression on dual targets, guaranteeing multiplier convergence and optimal constrained policies when combined with SAC.
-
Improving Feasibility via Fast Autoencoder-Based Projections
An adversarially trained autoencoder learns a convex latent space to enable rapid approximate projections that enforce nonconvex constraints in optimization and reinforcement learning.
-
Fatigue-Aware Learning to Defer via Constrained Optimisation
FALCON incorporates psychologically grounded fatigue curves into learning-to-defer via a CMDP formulation and PPO-Lagrangian optimization, outperforming prior L2D methods and generalizing to unseen fatigue patterns on the new FA-L2D benchmark.
-
Generative Auto-Bidding with Unified Modeling and Exploration
GUIDE integrates a Decision Transformer for joint modeling of bidding actions and states with Q-value regularization for exploration and an IDM for safe policy fallback, outperforming baselines in simulations and real Taobao deployment with gains in GMV, clicks, cost, and ROI.
-
Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability
Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than belief planning baselines.
-
Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation
CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.
-
Optimal design of solar-battery hybrid resources considering multi-market participation under weather and price uncertainty
A deep reinforcement learning co-optimization framework is developed for jointly sizing solar-battery hybrids and determining their multi-market bidding strategies under stochastic weather and price conditions.
-
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.
-
Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs
An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.
-
Shaping Zero-Shot Coordination via State Blocking
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
-
Learned Lyapunov Shielding for Adaptive Control
Learned Lyapunov functions, residual SAC policies, and PINNs are combined with a Slotine-Li controller and a closed-form safety filter to improve tracking on uncertain Euler-Lagrange systems while retaining stability guarantees.
-
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
-
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.
-
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
-
Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics
SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.
-
CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning
CAPSULE learns probabilistic control-affine dynamics offline to construct uncertainty-incorporating control barrier functions that enforce conservative safety constraints via online action correction in reinforcement learning.
-
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production
PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.
-
MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents
MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.
-
Analyzing Adversarial Inputs in Deep Reinforcement Learning
Introduces the Adversarial Rate metric and associated tools to systematically evaluate and visualize the impact of adversarial inputs on DRL policies using formal verification.
-
A Review On Safe Reinforcement Learning Using Lyapunov and Barrier Functions
A literature review of safe RL using Lyapunov and barrier functions that identifies a shift to model-free methods since 2017, well-defined open problems per approach class, and high-dimensional scalability as the main barrier.