Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Abhishek Gupta; Aravind Rajeswaran; Emanuel Todorov; Giulia Vezzani; John Schulman; Sergey Levine; Vikash Kumar

arxiv: 1709.10087 · v2 · pith:RDM4Z4DZ · submitted 2017-09-28 · cs.LG · cs.AI· cs.RO

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran , Vikash Kumar , Abhishek Gupta , Giulia Vezzani , John Schulman , Emanuel Todorov , Sergey Levine This is my paper

Reviewed by Pith T0 review T1 audit T2 compute T3 formal T4 kernel pith:RDM4Z4DZ record.json open to challenge →

classification cs.LG cs.AIcs.RO

keywords learningcomplexdemonstrationsdexterousmanipulationsampletasksbeen

0 comments

read the original abstract

Dexterous multi-fingered hands are extremely versatile and provide a generic way to perform a multitude of tasks in human-centric environments. However, effectively controlling them remains challenging due to their high dimensionality and large number of potential contacts. Deep reinforcement learning (DRL) provides a model-agnostic approach to control complex dynamical systems, but has not been shown to scale to high-dimensional dexterous manipulation. Furthermore, deployment of DRL on physical systems remains challenging due to sample inefficiency. Consequently, the success of DRL in robotics has thus far been limited to simpler manipulators and tasks. In this work, we show that model-free DRL can effectively scale up to complex manipulation tasks with a high-dimensional 24-DoF hand, and solve them from scratch in simulated experiments. Furthermore, with the use of a small number of human demonstrations, the sample complexity can be significantly reduced, which enables learning with sample sizes equivalent to a few hours of robot experience. The use of demonstrations result in policies that exhibit very natural movements and, surprisingly, are also substantially more robust.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Offline Reinforcement Learning with Implicit Q-Learning
cs.LG 2021-10 unverdicted novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
cs.AI 2026-06 unverdicted novelty 7.0

SGPO extracts strategies from strong-model responses, builds autonomous and guided trajectories, and applies token-level forward-KL distillation with adaptive weighting to outperform SFT and RL baselines by 2.2 points...
Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fin...
DSSP: Diffusion State Space Policy with Full-History Encoding
cs.RO 2026-05 conditional novelty 7.0

DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size...
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 unverdicted novelty 7.0

HDP3 is a pocket-scale 3D diffusion policy with a Diffusion Mixer decoder that achieves state-of-the-art visuomotor control using two-step DDIM inference and under 1% of the parameters of prior 3D diffusion policies.
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 conditional novelty 7.0

Frequency analysis of smooth robot actions bounds denoising error to low-frequency modes, enabling a sub-1% parameter 3D diffusion policy with two-step inference that reaches SOTA on manipulation benchmarks.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
cs.RO 2026-04 unverdicted novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving hi...
CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 7.0

CoLA-Flow Policy encodes action sequences into a continuous latent space and learns an explicit flow there, yielding near-single-step inference with up to 93.7% smoother trajectories and 25-point higher task success t...
Information Filtering via Variational Regularization for Robot Manipulation
cs.RO 2026-01 unverdicted novelty 7.0

Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld whil...
From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning
cs.LG 2025-11 unverdicted novelty 7.0

DARE performs sample-level constraint relaxation in offline-to-online RL by conditioning on behavioral consistency with a behavior model via posterior-induced exchange, yielding improved fine-tuning stability and perf...
From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning
cs.LG 2025-11 unverdicted novelty 7.0

DARE provides a distribution-aware sample-level constraint release mechanism for offline-to-online RL based on behavioral consistency with a behavior model, supported by theoretical analysis and D4RL experiments showi...
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
cs.RO 2022-09 unverdicted novelty 7.0

VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.
Solving Rubik's Cube with a Robot Hand
cs.LG 2019-10 accept novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Strategic Bargaining in Multi-Buyer Markets: Reinforcement Learning from Verifiable Rewards for LLM Negotiations
cs.LG 2026-07 conditional novelty 6.0

RLVR training teaches a 30B LLM to strategically explore a multi-buyer market and extract 70% of available surplus, outperforming frontier models up to 1T parameters in concurrent negotiation.
One Demonstration Is Enough for Real-World Robotic Reinforcement Learning
cs.RO 2026-07 unverdicted novelty 6.0

AutoSERL achieves strong performance on six real-world robot manipulation tasks using RL guided by a single demonstration via sliding-window intervention, safety recovery, and automatic termination.
Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization
cs.LG 2026-07 unverdicted novelty 6.0

Active-GRPO reaches 0.1773 average SRxSim on TOMG-Bench MOLOPT by adaptively switching between imitation and self-reinforcement while upgrading references, outperforming GRPO and RePO.
Enforcing Human-like Kinematics in Dexterous Piano Playing via Adversarial Posture Regularization
cs.RO 2026-06 unverdicted novelty 6.0

Adversarial Posture Regularization matches RL policy posture distributions to casual human piano-playing data to enforce human-like kinematics in dexterous hands, outperforming baselines on cPSI, BSE, and FAC metrics.
RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

RARM is a lightweight visual comparator trained once on general videos that supplies dense progress rewards to RL by matching rollout clips to a reference demonstration and gating rewards on match confidence.
FlexPath: Learned Semantic Path Priors for Image-Based Planning
cs.CV 2026-06 unverdicted novelty 6.0

FlexPath decouples learning of task-independent feasible path priors from task-specific adaptation via imitation learning and differentiable Path Shape Objectives for image-based planning.
Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video
cs.RO 2026-06 unverdicted novelty 6.0

Video2Sim2Real turns a single human video into a deployable robot manipulation skill by reconstructing a digital twin, anchoring motions to object-centric simulator configurations, and bridging sim-to-real gaps with i...
Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal
cs.RO 2026-05 unverdicted novelty 6.0

FGO guides diffusion policy generation via expanding spectral bands on sub-frequency manifolds to improve action smoothness on 15 robotic manipulation tasks.
Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Target-Aligned Bellman Backup (TABB) improves cross-domain offline RL by selecting source transitions according to their contribution to accurate target-domain Bellman target estimation.
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
cs.RO 2026-05 unverdicted novelty 6.0

DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing
cs.LG 2026-05 unverdicted novelty 6.0

DD-SRad is a new RL constraint technique that adapts per-actuator radii dynamically to achieve zero violations and unconstrained-level task performance on heterogeneous robotic joints.
Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control
cs.RO 2026-05 conditional novelty 6.0

A multi-agent RL high-level planner outputs task-space velocities that a GPU-parallel QP low-level controller converts to joint velocities while enforcing limits and collisions, yielding robust sim-to-real dexterous g...
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 unverdicted novelty 6.0

Hydra-DP3 achieves SOTA visuomotor performance with under 1% of prior 3D diffusion policy parameters by using frequency analysis to justify a lightweight decoder and two-step DDIM inference.
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
cs.RO 2026-05 unverdicted novelty 6.0

Hydra-DP3 is a lightweight 3D diffusion policy that uses frequency analysis of smooth action trajectories to enable two-step DDIM inference and achieves state-of-the-art results with under 1% of prior parameters.
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation
cs.RO 2026-02 unverdicted novelty 6.0

A unified parameter space and canonical URDF enable cross-embodiment dexterous grasping policies with 81.9% zero-shot success on unseen hands like the 3-finger LEAP Hand.
TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance
cs.AI 2025-09 unverdicted novelty 6.0

TimeRewarder derives step-wise progress rewards from frame-wise temporal distances in passive videos and uses them to guide RL, achieving high success rates on Meta-World tasks with fewer interactions than prior metho...
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
cs.RO 2025-05 conditional novelty 6.0

VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
Training Language Models to Self-Correct via Reinforcement Learning
cs.LG 2024-09 unverdicted novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Diffusion Policy Policy Optimization
cs.RO 2024-09 unverdicted novelty 6.0

DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
Proximal Policy Distillation
cs.LG 2024-07 conditional novelty 6.0

PPD integrates PPO into policy distillation so the student collects and uses its own rewards, yielding better sample efficiency and robustness than standard student-distill or teacher-distill on ATARI, Mujoco, and Pro...
R3M: A Universal Visual Representation for Robot Manipulation
cs.RO 2022-03 unverdicted novelty 6.0

A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks fr...
Stage-Transition Dense Reward Modeling for Reinforcement Learning
cs.RO 2026-06 unverdicted novelty 5.0

STDR infers stage structure from expert videos to supply stage-transition and within-stage progress rewards, improving RL sample efficiency on 14 manipulation tasks.
HOIST: Humanoid Optimization with Imitation and Sample-efficient Tuning for Manipulating Suspended Loads
cs.RO 2026-05 unverdicted novelty 5.0

HOIST finetunes a VLA policy from VR demonstrations then applies iterative batched RL to cut translational placement error by 19.9 cm and angular error by 3.56 degrees versus pure VLA on suspended-load manipulation.
Implicit Action Chunking for Smooth Continuous Control
cs.RO 2026-05 unverdicted novelty 5.0

Dual-Window Smoothing uses an execution window for deterministic smoothness and a value window to correct critic bias, plus a first-order temporal regularizer, to achieve smoother RL control than explicit chunking or ...
FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy
cs.RO 2026-05 unverdicted novelty 5.0

FocalPolicy introduces frequency-optimized chunking and locally anchored flow matching with a foresight composite objective to reduce inter-chunk discontinuities in visuomotor policies.
FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy
cs.RO 2026-05 unverdicted novelty 5.0

FocalPolicy introduces frequency-optimized chunking and locally anchored flow matching with a foresight composite objective to improve inter-chunk coherence in visuomotor policies for manipulation tasks.
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
cs.RO 2026-05 unverdicted novelty 5.0

X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies
cs.LG 2026-05 unverdicted novelty 5.0

Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.
CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 5.0

CoLA-Flow Policy encodes action sequences into latent trajectories and performs flow matching there, yielding near-single-step inference with up to 93.7% smoother trajectories and 25-point higher success rates than ra...
From monoliths to modules: Decomposing transducers for efficient world modelling
cs.AI 2025-12 unverdicted novelty 5.0

A framework for decomposing transducers into sub-transducers on distinct subspaces to enable parallel and interpretable world models.
TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance
cs.AI 2025-09 unverdicted novelty 5.0

TimeRewarder derives progress-based dense rewards from passive videos via frame-wise temporal distance modeling and uses them as proxy rewards to boost RL success on Meta-World tasks.
Learning to Solve a Rubik's Cube with a Dexterous Hand
cs.RO 2019-07 unverdicted novelty 5.0

Hierarchical RL combines a model-based cube solver with a model-free hand controller to solve Rubik's cubes in simulation, achieving 90.3% success on 1400 random scrambles.
Reasoning and Generalization in RL: A Tool Use Perspective
cs.NE 2019-07 unverdicted novelty 5.0

Proposes a tool-use inspired framework with multiple test sets to measure specified types of generalization in RL.
EaDex: A Cross-Embodiment Dexterous Manipulation Framework from Low-Cost Demonstrations
cs.RO 2026-06 unverdicted novelty 4.0

EaDex combines single-camera RGB-D capture, MANO retargeting, and dynamic demonstration annealing to achieve 55.3% relative improvement over baseline on nine cross-embodiment dexterous object-opening tasks across three hands.
Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization
cs.RO 2026-05 unverdicted novelty 3.0

HiMAQ applies hierarchical vector quantization to human demonstrations to generate macro actions that yield higher human-likeness scores than flat MAQ on D4RL while matching or exceeding success rates across IQL, SAC,...
On Multi-Agent Learning in Team Sports Games
cs.MA 2019-06 unverdicted novelty 3.0

Describes a hierarchical RL method for multi-agent learning in team sports games aiming for human-like agents, reporting preliminary results that show promise.