Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.
Efficient online reinforcement learning fine-tuning need not retain offline data.arXiv preprint arXiv:2412.07762
8 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.
HandelBot refines simulation policies via physical rollouts and residual RL to achieve precise bimanual piano playing, outperforming direct sim transfer by 1.8x with only 30 minutes of real data across five songs.
citing papers explorer
-
Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation
Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.
-
WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
-
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
-
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
-
Reinforcement Learning with Action Chunking
Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.
-
COOPO: Cyclic Offline-Online Policy Optimization Algorithm
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.
-
HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies
HandelBot refines simulation policies via physical rollouts and residual RL to achieve precise bimanual piano playing, outperforming direct sim transfer by 1.8x with only 30 minutes of real data across five songs.