Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns

Dong Tian; Gerhard Neumann; Onur Celik

arxiv: 2503.03660 · v4 · pith:ZLUVBY7Xnew · submitted 2025-03-05 · 💻 cs.LG

Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns

Dong Tian , Onur Celik , Gerhard Neumann This is my paper

classification 💻 cs.LG

keywords criticactor-criticchunkingreturnssoftsteptrajectorytransformer

0 comments

read the original abstract

We introduce a sequence-conditioned critic for Soft Actor-Critic (SAC) that models trajectory context with a lightweight Transformer and trains on aggregated $N$-step targets. Unlike prior approaches that (i) score state-action pairs in isolation or (ii) rely on actor-side action chunking to handle long horizons, our method strengthens the critic itself by conditioning on short trajectory segments and integrating multi-step returns -- without importance sampling (IS). The resulting sequence-aware value estimates capture the critical temporal structure for extended-horizon and sparse-reward problems. On local-motion benchmarks, we further show that freezing critic parameters for several steps makes our update compatible with CrossQ's core idea, enabling stable training \emph{without} a target network. Despite its simplicity -- a 2-layer Transformer with 128-256 hidden units and a maximum update-to-data ratio (UTD) of $1$ -- the approach consistently outperforms standard SAC and strong off-policy baselines, with particularly large gains on long-trajectory control. These results highlight the value of sequence modeling and $N$-step bootstrapping on the critic side for long-horizon reinforcement learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.
Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers
cs.RO 2026-04 unverdicted novelty 6.0

WHOLE-MoMa improves whole-body mobile manipulation by applying offline RL with Q-chunking to demonstrations from randomized sub-optimal controllers, outperforming baselines and transferring to real robots without tele...
Reinforcement Learning with Action Chunking
cs.LG 2025-07 unverdicted novelty 6.0

Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.