CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
hub Mixed citations
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
Mixed citation behavior. Most common role is background (67%).
abstract
Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies -- a state-of-the-art BC methodology -- we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.
A2A flow matching starts action generation from prior proprioceptive actions in latent space to enable single-step high-quality predictions in robotic policies.
Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
Diff-CAST replaces GAN discriminators with diffusion-based priors and adds symmetric command conditioning plus constrained RL to enable versatile, drift-free, and hardware-safe quadruped locomotion.
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.
OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.
LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist policies.
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
MATT-Diff uses a diffusion model with vision transformer and attention to generate multimodal actions for active multi-target tracking from expert planner demonstrations.
FORCE is a 3-stage RL fine-tuning method for VLA models that stabilizes Q-function via on-policy warm-up and filters high-value actions for updates, claiming 79% success rate gains and 32.5% faster training without human intervention.
LP-DS improves generative policies for imitation and RL by optimizing latent noise perturbations with a constrained Lagrangian objective, showing up to 25% better returns on manipulation and locomotion tasks.
SDPG is a new on-policy visual RL algorithm that estimates gradients via stochastic perturbations of rollouts, achieving faster training and lower memory use than baselines on visual MuJoCo tasks while adding new robotics benchmarks and sim-to-real results.
EXPO-FT enables pretrained VLA policies to reach 30/30 success on complex manipulation tasks using an average of 19.1 minutes of online robot data while outperforming prior RL approaches.
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
citing papers explorer
-
Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance
CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
-
What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
-
PlayWorld: Learning Robot World Models from Autonomous Play
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.
-
Action-to-Action Flow Matching
A2A flow matching starts action generation from prior proprioceptive actions in latent space to enable single-step high-quality predictions in robotic policies.
-
WarmPrior: Straightening Flow-Matching Policies with Temporal Priors
Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion
Diff-CAST replaces GAN discriminators with diffusion-based priors and adds symmetric command conditioning plus constrained RL to enable versatile, drift-free, and hardware-safe quadruped locomotion.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.
-
Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist policies.
-
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
-
What Does Flow Matching Bring To TD Learning?
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
-
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
-
FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation
FORCE is a 3-stage RL fine-tuning method for VLA models that stabilizes Q-function via on-policy warm-up and filters high-value actions for updates, claiming 79% success rate gains and 32.5% faster training without human intervention.
-
Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies
LP-DS improves generative policies for imitation and RL by optimizing latent noise perturbations with a constrained Lagrangian objective, showing up to 25% better returns on manipulation and locomotion tasks.
-
Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient
SDPG is a new on-policy visual RL algorithm that estimates gradients via stochastic perturbations of rollouts, achieving faster training and lower memory use than baselines on visual MuJoCo tasks while adding new robotics benchmarks and sim-to-real results.
-
EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models
EXPO-FT enables pretrained VLA policies to reach 30/30 success on complex manipulation tasks using an average of 19.1 minutes of online robot data while outperforming prior RL approaches.
-
Towards Robotic Dexterous Hand Intelligence: A Survey
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
- You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector