hub Baseline reference

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, Sergey Levine · 2017 · cs.LG · arXiv 1709.10087

Baseline reference. 50% of citing Pith papers use this work as a benchmark or comparison.

33 Pith papers citing it

Baseline 50% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

Dexterous multi-fingered hands are extremely versatile and provide a generic way to perform a multitude of tasks in human-centric environments. However, effectively controlling them remains challenging due to their high dimensionality and large number of potential contacts. Deep reinforcement learning (DRL) provides a model-agnostic approach to control complex dynamical systems, but has not been shown to scale to high-dimensional dexterous manipulation. Furthermore, deployment of DRL on physical systems remains challenging due to sample inefficiency. Consequently, the success of DRL in robotics has thus far been limited to simpler manipulators and tasks. In this work, we show that model-free DRL can effectively scale up to complex manipulation tasks with a high-dimensional 24-DoF hand, and solve them from scratch in simulated experiments. Furthermore, with the use of a small number of human demonstrations, the sample complexity can be significantly reduced, which enables learning with sample sizes equivalent to a few hours of robot experience. The use of demonstrations result in policies that exhibit very natural movements and, surprisingly, are also substantially more robust.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 dataset 3

citation-polarity summary

use dataset 3 background 2 unclear 1

representative citing papers

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Offline Reinforcement Learning with Implicit Q-Learning

cs.LG · 2021-10-12 · unverdicted · novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

DSSP: Diffusion State Space Policy with Full-History Encoding

cs.RO · 2026-05-14 · conditional · novelty 7.0

DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation

cs.RO · 2026-01-30 · unverdicted · novelty 7.0 · 2 refs

CoLA-Flow Policy encodes action sequences into a continuous latent space and learns an explicit flow there, yielding near-single-step inference with up to 93.7% smoother trajectories and 25-point higher task success than raw-action flow baselines.

Information Filtering via Variational Regularization for Robot Manipulation

cs.RO · 2026-01-29 · unverdicted · novelty 7.0

Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld while achieving new state-of-the-art results.

From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning

cs.LG · 2025-11-05 · unverdicted · novelty 7.0 · 2 refs

DARE performs sample-level constraint relaxation in offline-to-online RL by conditioning on behavioral consistency with a behavior model via posterior-induced exchange, yielding improved fine-tuning stability and performance on D4RL benchmarks.

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

cs.RO · 2022-09-30 · unverdicted · novelty 7.0

VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.

Solving Rubik's Cube with a Robot Hand

cs.LG · 2019-10-16 · accept · novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

cs.RO · 2026-04-13 · unverdicted · novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.

Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving higher success rates in simulated and real tasks.

Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Target-Aligned Bellman Backup (TABB) improves cross-domain offline RL by selecting source transitions according to their contribution to accurate target-domain Bellman target estimation.

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

cs.RO · 2026-05-18 · unverdicted · novelty 6.0

DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.

One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 6.0

A unified parameter space and canonical URDF enable cross-embodiment dexterous grasping policies with 81.9% zero-shot success on unseen hands like the 3-finger LEAP Hand.

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

cs.AI · 2025-09-30 · unverdicted · novelty 6.0

TimeRewarder derives step-wise progress rewards from frame-wise temporal distances in passive videos and uses them to guide RL, achieving high success rates on Meta-World tasks with fewer interactions than prior methods or hand-designed rewards.

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

cs.RO · 2025-05-24 · conditional · novelty 6.0

VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

Diffusion Policy Policy Optimization

cs.RO · 2024-09-01 · unverdicted · novelty 6.0

DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.

Proximal Policy Distillation

cs.LG · 2024-07-21 · conditional · novelty 6.0

PPD integrates PPO into policy distillation so the student collects and uses its own rewards, yielding better sample efficiency and robustness than standard student-distill or teacher-distill on ATARI, Mujoco, and Procgen tasks.

R3M: A Universal Visual Representation for Robot Manipulation

cs.RO · 2022-03-23 · unverdicted · novelty 6.0

A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks from 20 demonstrations.

Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

DD-SRad is a new RL constraint technique that adapts per-actuator radii dynamically to achieve zero violations and unconstrained-level task performance on heterogeneous robotic joints.

Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control

cs.RO · 2026-05-05 · conditional · novelty 6.0

A multi-agent RL high-level planner outputs task-space velocities that a GPU-parallel QP low-level controller converts to joint velocities while enforcing limits and collisions, yielding robust sim-to-real dexterous grasping with zero-shot steerability.

VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robotic policies.

citing papers explorer

Showing 33 of 33 citing papers.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 270 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Offline Reinforcement Learning with Implicit Q-Learning cs.LG · 2021-10-12 · unverdicted · none · ref 11
IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
DSSP: Diffusion State Space Policy with Full-History Encoding cs.RO · 2026-05-14 · conditional · none · ref 45 · internal anchor
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling cs.LG · 2026-05-14 · unverdicted · none · ref 193 · internal anchor
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation cs.RO · 2026-01-30 · unverdicted · none · ref 33 · 2 links · internal anchor
CoLA-Flow Policy encodes action sequences into a continuous latent space and learns an explicit flow there, yielding near-single-step inference with up to 93.7% smoother trajectories and 25-point higher task success than raw-action flow baselines.
Information Filtering via Variational Regularization for Robot Manipulation cs.RO · 2026-01-29 · unverdicted · none · ref 12 · internal anchor
Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld while achieving new state-of-the-art results.
From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning cs.LG · 2025-11-05 · unverdicted · none · ref 10 · 2 links · internal anchor
DARE performs sample-level constraint relaxation in offline-to-online RL by conditioning on behavioral consistency with a behavior model via posterior-induced exchange, yielding improved fine-tuning stability and performance on D4RL benchmarks.
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training cs.RO · 2022-09-30 · unverdicted · none · ref 24 · internal anchor
VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.
Solving Rubik's Cube with a Robot Hand cs.LG · 2019-10-16 · accept · none · ref 86 · internal anchor
Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 46
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching cs.RO · 2026-04-13 · unverdicted · none · ref 16
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation cs.RO · 2026-04-07 · unverdicted · none · ref 29
ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving higher success rates in simulated and real tasks.
Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning cs.LG · 2026-05-21 · unverdicted · none · ref 32 · internal anchor
Target-Aligned Bellman Backup (TABB) improves cross-domain offline RL by selecting source transitions according to their contribution to accurate target-domain Bellman target estimation.
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System cs.RO · 2026-05-18 · unverdicted · none · ref 46 · internal anchor
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 20 · internal anchor
A unified parameter space and canonical URDF enable cross-embodiment dexterous grasping policies with 81.9% zero-shot success on unseen hands like the 3-finger LEAP Hand.
TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance cs.AI · 2025-09-30 · unverdicted · none · ref 9 · internal anchor
TimeRewarder derives step-wise progress rewards from frame-wise temporal distances in passive videos and uses them to guide RL, achieving high success rates on Meta-World tasks with fewer interactions than prior methods or hand-designed rewards.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 60 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 211 · internal anchor
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Diffusion Policy Policy Optimization cs.RO · 2024-09-01 · unverdicted · none · ref 76 · internal anchor
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
Proximal Policy Distillation cs.LG · 2024-07-21 · conditional · none · ref 10 · internal anchor
PPD integrates PPO into policy distillation so the student collects and uses its own rewards, yielding better sample efficiency and robustness than standard student-distill or teacher-distill on ATARI, Mujoco, and Procgen tasks.
R3M: A Universal Visual Representation for Robot Manipulation cs.RO · 2022-03-23 · unverdicted · none · ref 20 · internal anchor
A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks from 20 demonstrations.
Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing cs.LG · 2026-05-05 · unverdicted · none · ref 21
DD-SRad is a new RL constraint technique that adapts per-actuator radii dynamically to achieve zero violations and unconstrained-level task performance on heterogeneous robotic joints.
Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control cs.RO · 2026-05-05 · conditional · none · ref 33
A multi-agent RL high-level planner outputs task-space velocities that a GPU-parallel QP low-level controller converts to joint velocities while enforcing limits and collisions, yielding robust sim-to-real dexterous grasping with zero-shot steerability.
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation cs.RO · 2026-04-17 · unverdicted · none · ref 25
VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robotic policies.
Implicit Action Chunking for Smooth Continuous Control cs.RO · 2026-05-19 · unverdicted · none · ref 62 · internal anchor
Dual-Window Smoothing uses an execution window for deterministic smoothness and a value window to correct critic bias, plus a first-order temporal regularizer, to achieve smoother RL control than explicit chunking or standard baselines.
FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy cs.RO · 2026-05-15 · unverdicted · none · ref 1 · 2 links · internal anchor
FocalPolicy introduces frequency-optimized chunking and locally anchored flow matching with a foresight composite objective to reduce inter-chunk discontinuities in visuomotor policies.
From monoliths to modules: Decomposing transducers for efficient world modelling cs.AI · 2025-12-01 · unverdicted · none · ref 10 · internal anchor
A framework for decomposing transducers into sub-transducers on distinct subspaces to enable parallel and interpretable world models.
Learning to Solve a Rubik's Cube with a Dexterous Hand cs.RO · 2019-07-26 · unverdicted · none · ref 4 · internal anchor
Hierarchical RL combines a model-based cube solver with a model-free hand controller to solve Rubik's cubes in simulation, achieving 90.3% success on 1400 random scrambles.
Reasoning and Generalization in RL: A Tool Use Perspective cs.NE · 2019-07-03 · unverdicted · none · ref 23 · internal anchor
Proposes a tool-use inspired framework with multiple test sets to measure specified types of generalization in RL.
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction cs.RO · 2026-05-12 · unverdicted · none · ref 48
X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies cs.LG · 2026-05-12 · unverdicted · none · ref 70
Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.
On Multi-Agent Learning in Team Sports Games cs.MA · 2019-06-25 · unverdicted · none · ref 11 · internal anchor
Describes a hierarchical RL method for multi-agent learning in team sports games aiming for human-like agents, reporting preliminary results that show promise.
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control cs.RO · 2026-05-02 · unreviewed · ref 18 · 3 links

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer