Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns
read the original abstract
We introduce a sequence-conditioned critic for Soft Actor-Critic (SAC) that models trajectory context with a lightweight Transformer and trains on aggregated $N$-step targets. Unlike prior approaches that (i) score state-action pairs in isolation or (ii) rely on actor-side action chunking to handle long horizons, our method strengthens the critic itself by conditioning on short trajectory segments and integrating multi-step returns -- without importance sampling (IS). The resulting sequence-aware value estimates capture the critical temporal structure for extended-horizon and sparse-reward problems. On local-motion benchmarks, we further show that freezing critic parameters for several steps makes our update compatible with CrossQ's core idea, enabling stable training \emph{without} a target network. Despite its simplicity -- a 2-layer Transformer with 128-256 hidden units and a maximum update-to-data ratio (UTD) of $1$ -- the approach consistently outperforms standard SAC and strong off-policy baselines, with particularly large gains on long-trajectory control. These results highlight the value of sequence modeling and $N$-step bootstrapping on the critic side for long-horizon reinforcement learning.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning
Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.
-
Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers
WHOLE-MoMa improves whole-body mobile manipulation by applying offline RL with Q-chunking to demonstrations from randomized sub-optimal controllers, outperforming baselines and transferring to real robots without tele...
-
Reinforcement Learning with Action Chunking
Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.