Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Hao Su; Mengke Zhang; Stone Tao; Tongzhou Mu; Xiu Yuan; Yunhao Fang

arxiv: 2412.13630 · v1 · pith:RIWLAU73new · submitted 2024-12-18 · 💻 cs.RO · cs.AI· cs.LG

Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Xiu Yuan , Tongzhou Mu , Stone Tao , Yunhao Fang , Mengke Zhang , Hao Su This is my paper

classification 💻 cs.RO cs.AIcs.LG

keywords learningpolicymodelsimitationdecoratoronlinelargepolicies

0 comments

read the original abstract

Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity, quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks-ManiSkill and Adroit-and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies. See our project page (https://policydecorator.github.io) for videos.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning
cs.RO 2026-06 unverdicted novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies
cs.RO 2026-06 unverdicted novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-sh...
EXPO: Stable Reinforcement Learning with Expressive Policies
cs.LG 2025-07 conditional novelty 7.0

EXPO stabilizes online RL for expressive policies by training a base policy with imitation and using a lightweight Gaussian edit policy to select higher-value actions on the fly for sampling and TD backups.
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
cs.RO 2025-06 unverdicted novelty 7.0

DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
Learning Process Rewards via Success Visitation Matching for Efficient RL
cs.LG 2026-06 unverdicted novelty 6.0

Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic...
MODIP: Efficient Model-Based Optimization for Diffusion Policies
cs.LG 2026-06 unverdicted novelty 6.0

MODIP fine-tunes diffusion policies offline-to-online by training a world model, running MPC with terminal state values inside it to create targets, and using policy-independent TD critics, yielding gains over BC on D...
Flow-based Policy Adaptation without Policy Updates
cs.RO 2026-06 unverdicted novelty 6.0

GLOVES learns flow models from limited expert demonstrations to selectively correct actions from non-expert policies or operators toward expert distributions using reverse-flow OOD detection as an intervention gate.
Closed-Loop Neural Activation Control in Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

CTRL-STEER applies PID or RL-based feedback control to adaptively steer motion-aligned residual directions in VLA models, yielding more stable regulation and better task success on LIBERO benchmarks than fixed steering.
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.
Fisher Decorator: Refining Flow Policy via a Local Transport Map
cs.LG 2026-04 unverdicted novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
cs.RO 2026-03 conditional novelty 6.0

ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning
cs.RO 2026-02 unverdicted novelty 6.0

LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a ...
HiL-ResRL: A Model-Agnostic Finetuning Adapter via Human-in-the-loop Residual Reinforcement Learning
cs.RO 2026-06 unverdicted novelty 5.0

HiL-ResRL trains a model-agnostic residual policy on VLA actions using human-guided online RL, achieving over 95% success rate after 1.5 hours of real-robot training.
Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 5.0

GTP-FA is a grasp-then-plan framework with failure attribution that diagnoses errors to optimize grasping priors and planning data collection, raising success rates across RL, IL, diffusion, and VLA methods in simulat...
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
cs.RO 2026-05 unverdicted novelty 5.0

DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.