Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

Jinzong Dong , Wei Huang , Jianshu Zhang , Zhuo Chen , Xinzhe Yuan , Qinying Gu , Zhaohui Jiang , Nanyang Ye

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords actionsactor-criticdatasetofflinetrainingactionbehaviorbetter

read the original abstract

Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Refining Compositional Diffusion for Reliable Long-Horizon Planning
cs.RO 2026-05 unverdicted novelty 6.0

RCD steers compositional diffusion sampling toward high-density coherent plans by combining reconstruction-error guidance with overlap consistency, outperforming prior methods on locomotion, manipulation, and pixel-ba...