DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
Off-policy deep reinforcement learning without exploration
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 6roles
background 1polarities
background 1representative citing papers
Target-Aligned Bellman Backup (TABB) improves cross-domain offline RL by selecting source transitions according to their contribution to accurate target-domain Bellman target estimation.
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
VGM²P achieves SOTA-comparable performance in offline MARL via value-guided conditional behavior cloning with MeanFlow, enabling efficient single-step action generation insensitive to regularization coefficients.
Weighted BC estimates trajectory density ratios from a clean reference set via binary discrimination and reweights the BC loss to converge to the clean expert policy with finite-sample bounds independent of contamination rate.
Robust minimax task inference in BFMs achieves dynamics-shift robustness from nominal offline data alone and outperforms standard baselines.
citing papers explorer
-
DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty
DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
-
Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning
Target-Aligned Bellman Backup (TABB) improves cross-domain offline RL by selecting source transitions according to their contribution to accurate target-domain Bellman target estimation.
-
Shaping Zero-Shot Coordination via State Blocking
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
-
Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning
VGM²P achieves SOTA-comparable performance in offline MARL via value-guided conditional behavior cloning with MeanFlow, enabling efficient single-step action generation insensitive to regularization coefficients.
-
Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets
Weighted BC estimates trajectory density ratios from a clean reference set via binary discrimination and reweights the BC loss to converge to the clean expert policy with finite-sample bounds independent of contamination rate.
-
When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited
Robust minimax task inference in BFMs achieves dynamics-shift robustness from nominal offline data alone and outperforms standard baselines.