DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Advances in neural information processing systems , volume=
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 1polarities
background 1representative citing papers
TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.
SeqRejectron constructs a stopping rule with a small set of validator policies to achieve horizon-free sample complexity for selective imitation learning under arbitrary dynamics shifts.
Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
citing papers explorer
-
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
-
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.
-
Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift
SeqRejectron constructs a stopping rule with a small set of validator policies to achieve horizon-free sample complexity for selective imitation learning under arbitrary dynamics shifts.
-
Learning the Preferences of a Learning Agent
Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
- REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak