Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Mel Vecerik , Todd Hester , Jonathan Scholz , Fumin Wang , Olivier Pietquin , Bilal Piot , Nicolas Heess , Thomas Roth\"orl

show 2 more authors

Thomas Lampe Martin Riedmiller

Authors on Pith no claims yet

classification 💻 cs.AI

keywords demonstrationsrewardsroboticsddpgengineeredcarefullydeeplearning

0 comments

read the original abstract

We propose a general and model-free approach for Reinforcement Learning (RL) on real robotics with sparse rewards. We build upon the Deep Deterministic Policy Gradient (DDPG) algorithm to use demonstrations. Both demonstrations and actual interactions are used to fill a replay buffer and the sampling ratio between demonstrations and transitions is automatically tuned via a prioritized replay mechanism. Typically, carefully engineered shaping rewards are required to enable the agents to efficiently explore on high dimensional control problems such as robotics. They are also required for model-based acceleration methods relying on local solvers such as iLQG (e.g. Guided Policy Search and Normalized Advantage Function). The demonstrations replace the need for carefully engineered rewards, and reduce the exploration problem encountered by classical RL approaches in these domains. Demonstrations are collected by a robot kinesthetically force-controlled by a human demonstrator. Results on four simulated insertion tasks show that DDPG from demonstrations out-performs DDPG, and does not require engineered rewards. Finally, we demonstrate the method on a real robotics task consisting of inserting a clip (flexible object) into a rigid object.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
cs.LG 2026-05 unverdicted novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
Stable GFlowNets with Probabilistic Guarantees
cs.LG 2026-05 unverdicted novelty 7.0

Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.
ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization
cs.LG 2026-05 unverdicted novelty 6.0

ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.
Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing
cs.LG 2026-05 unverdicted novelty 6.0

DD-SRad is a new RL constraint technique that adapts per-actuator radii dynamically to achieve zero violations and unconstrained-level task performance on heterogeneous robotic joints.
PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC
cs.LG 2026-04 unverdicted novelty 6.0

PriPG-RL trains RL policies for POMDPs by distilling knowledge from a privileged anytime-feasible MPC planner into a P2P-SAC policy, improving sample efficiency and performance in partially observable robotic navigation.
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies
cs.LG 2026-05 unverdicted novelty 5.0

Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.
XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies
cs.LG 2026-05 unverdicted novelty 5.0

XQCfD accelerates actor-critic RL by using prior data, pretrained policies, and stationary architectures to achieve state-of-the-art results on Adroit, Robomimic, and MimicGen manipulation benchmarks with low update-t...
Soft Deterministic Policy Gradient with Gaussian Smoothing
cs.LG 2026-05 unverdicted novelty 5.0

Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discr...
Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
cs.LG 2026-04 unverdicted novelty 5.0

VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.