Recognition: unknown
Reinforcement Learning with Action Chunking
read the original abstract
We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
This paper has not been read by Pith yet.
Forward citations
Cited by 8 Pith papers
-
Aligning Flow Map Policies with Optimal Q-Guidance
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
-
Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients
The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
-
GSDrive: Reinforcing Driving Policies by Multi-mode Trajectory Probing with 3D Gaussian Splatting Environment
GSDrive improves end-to-end driving policies through 3D Gaussian Splatting simulation and multi-mode trajectory probing that supplies dense, differentiable rewards for reinforcement learning.
-
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
-
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.