Reinforcement Learning from Imperfect Demonstrations

Fisher Yu; Huazhe Xu; Ji Lin; Sergey Levine; Trevor Darrell; Yang Gao

arxiv: 1802.05313 · v2 · pith:VXM2BKV7new · submitted 2018-02-14 · 💻 cs.AI · cs.LG· stat.ML

Reinforcement Learning from Imperfect Demonstrations

Yang Gao , Huazhe Xu , Ji Lin , Fisher Yu , Sergey Levine , Trevor Darrell This is my paper

classification 💻 cs.AI cs.LGstat.ML

keywords learningdemonstrationreinforcementdemonstrationsdataenvironmentalgorithmapproaches

0 comments

read the original abstract

Robust real-world learning should benefit from both demonstrations and interactions with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on the reward received from the environment. These tasks have divergent losses which are difficult to jointly optimize and such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. NAC learns an initial policy network from demonstrations and refines the policy in the environment, surpassing the demonstrator's performance. Crucially, both learning from demonstration and interactive refinement use the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decision Transformer: Reinforcement Learning via Sequence Modeling
cs.LG 2021-06 accept novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
cs.LG 2020-06 unverdicted novelty 6.0

AWAC combines offline data with online RL via advantage-weighted actor-critic updates to enable faster acquisition of robotic skills such as dexterous manipulation.
Attentive Multi-Task Deep Reinforcement Learning
cs.LG 2019-07 unverdicted novelty 6.0

Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.
When a Robot is More Capable than a Human: Learning from Constrained Demonstrators
cs.RO 2025-10 unverdicted novelty 5.0

Robots outperform constrained human demonstrations by inferring state-only rewards from demos and using temporal interpolation to label and explore better trajectories, achieving 10x faster task completion on a real r...