arxiv: 2602.07441 · v2 · pith:LLYFYPQ7new · submitted 2026-02-07 · 💻 cs.LG · cs.AI

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

Jinzong Dong , Wei Huang , Jianshu Zhang , Zhuo Chen , Xinzhe Yuan , Qinying Gu , Zhaohui Jiang , Nanyang Ye This is my paper

Pith reviewed 2026-05-16 06:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learningbehavior cloningactor-criticproximal action replacementTD3+BCpolicy improvementaction substitution

0 comments

The pith

Proximal action replacement overcomes the imitation ceiling in BC-regularized actor-critic by substituting suboptimal dataset actions with value-guided improvements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline RL often combines behavior cloning with actor-critic methods to avoid out-of-distribution actions, but this creates a performance ceiling when the fixed dataset contains suboptimal actions because the policy cannot deviate much from them even if the value function indicates better choices. The paper formally analyzes this limitation through convergence properties of the regularized optimization and verifies it experimentally on a simple continuous bandit task. To address the issue, they introduce proximal action replacement, a method that replaces training samples' actions with better ones generated from a stable target policy. These replacements follow the local ascent direction of the action-value function and are bounded by value uncertainty estimates to maintain stability. Experiments show that adding this replacement to basic TD3+BC consistently improves results on standard benchmarks and approaches state-of-the-art performance.

Core claim

The paper establishes that indiscriminate behavior cloning imposes a structural limit on actor-critic learning in offline settings when dataset actions are suboptimal, preventing full exploitation of value function suggestions. By replacing those actions with proximal actions from the target policy guided by Q-function ascent and uncertainty bounds, the optimization can escape this ceiling while preserving training stability.

What carries the argument

Proximal Action Replacement (PAR), which substitutes dataset actions in training samples with improved actions sampled from the target policy under local maximization of the action-value function, constrained by value uncertainty to avoid instability.

If this is right

PAR boosts performance when added to TD3+BC across multiple offline RL benchmarks.
PAR is compatible with different BC regularization approaches.
PAR allows the actor to exploit better actions suggested by the value function without destabilizing training.
The method approaches state-of-the-art results using only the basic TD3+BC base algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

PAR might generalize to other offline RL algorithms beyond actor-critic methods.
Implementing PAR could simplify the design of offline RL systems by reducing reliance on sophisticated regularization techniques.
Further tests in high-dimensional or discrete action spaces could reveal additional benefits or limitations of the replacement strategy.

Load-bearing premise

The target policy remains stable enough that its generated actions can replace dataset actions without causing training divergence or introducing harmful bias.

What would settle it

Observing that adding PAR to TD3+BC results in lower average returns or training instability on D4RL benchmark tasks compared to the baseline without PAR.

read the original abstract

Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PAR gives a simple replacement trick to push past the BC imitation ceiling in offline actor-critic, but the uncertainty bound that keeps it stable looks shaky against extrapolation error.

read the letter

The main takeaway is that this paper names a real limit in BC-regularized actor-critic methods—once imitation dominates, the actor cannot fully use value-function improvements when the dataset actions are suboptimal—and offers a plug-in fix called proximal action replacement. PAR swaps dataset actions for ones generated by the target policy along the local Q-gradient, clipped by a value-uncertainty bound. They back the diagnosis with a convergence argument and a low-dimensional bandit check, then show that adding PAR to plain TD3+BC lifts results across standard offline benchmarks and gets close to current leaders without extra machinery. That compatibility and the lightweight nature are the practical wins; it is easy to drop in on top of existing codebases. The experiments appear consistent, which is worth something in this area. The soft spot is exactly the one the stress-test flags: the method assumes the uncertainty bound will keep replaced actions inside the data support. In offline settings the critic sees only dataset transitions, so standard uncertainty estimators often underestimate error away from the data, especially once the policy has drifted. If the bound is loose, you risk reintroducing the out-of-distribution bias the whole approach is meant to avoid, and the bandit test does not catch this because the space is small and fully covered. The abstract mentions formal analysis, but without the equations or the precise uncertainty estimator it is hard to judge how tight the guarantee actually is. No error bars or detailed ablations are visible in the summary either. This is for researchers already working on BC-style regularization in offline RL who want a low-cost way to raise the performance floor. It has enough new mechanics and empirical signal to justify sending it to referees, even if the stability claim needs tighter verification in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that BC-regularized actor-critic methods in offline RL suffer from a performance ceiling when dataset actions are suboptimal, as analyzed through convergence properties and verified on a continuous bandit task. To overcome this, it introduces Proximal Action Replacement (PAR), which replaces dataset actions with improved actions from a stable target policy using local gradient ascent on the Q-function, bounded by value uncertainty. PAR is shown to be compatible with BC paradigms and, when combined with TD3+BC, consistently improves performance on offline RL benchmarks, approaching state-of-the-art results.

Significance. If the central claims hold, PAR represents a straightforward enhancement to existing offline RL methods that could allow better utilization of value function information without destabilizing training. The formal analysis and empirical gains on benchmarks would contribute to understanding and improving BC-based approaches in offline settings.

major comments (3)

[Convergence Analysis] Convergence Analysis section: The formal investigation of convergence properties of BC-regularized actor-critic optimization is load-bearing for identifying the performance ceiling, but the abstract provides no equations or key steps, making it difficult to assess how PAR specifically addresses the limitation without reducing to prior quantities.
[PAR Description] PAR Description section: The claim that bounding by value uncertainty σ(s,a) ensures training stability and prevents OOD actions is central, yet in offline RL the critic's extrapolation error can cause uncertainty estimators to underestimate far from data support, potentially allowing destabilizing replacements as training progresses.
[Experimental Verification] Experimental Verification section: The controlled bandit experiment is cited to verify the limitation, but without reported error bars, specific exclusion criteria, or tests in higher-dimensional spaces, it does not sufficiently address whether the uncertainty bound holds in realistic offline RL scenarios.

minor comments (2)

[Abstract] Abstract: The phrase 'approaches state-of-the-art results' would benefit from specifying the particular SOTA methods and metrics used for comparison.
[Throughout] Throughout the manuscript: Ensure all invented entities like 'Proximal Action Replacement (PAR)' are clearly defined upon first use with consistent notation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our convergence analysis and the practical considerations for PAR. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Convergence Analysis] Convergence Analysis section: The formal investigation of convergence properties of BC-regularized actor-critic optimization is load-bearing for identifying the performance ceiling, but the abstract provides no equations or key steps, making it difficult to assess how PAR specifically addresses the limitation without reducing to prior quantities.

Authors: We agree that the abstract would benefit from including the key convergence result to make the performance ceiling explicit. In the revised version we will add the main equation from Section 3 (the fixed-point relation showing that the BC-regularized actor converges to a convex combination of dataset actions) and a one-sentence statement of how PAR breaks this fixed point by replacing actions with target-policy improvements. revision: yes
Referee: [PAR Description] PAR Description section: The claim that bounding by value uncertainty σ(s,a) ensures training stability and prevents OOD actions is central, yet in offline RL the critic's extrapolation error can cause uncertainty estimators to underestimate far from data support, potentially allowing destabilizing replacements as training progresses.

Authors: This is a legitimate concern about uncertainty estimation under extrapolation. Our current bound is applied only to a single local gradient step from the original dataset action, which keeps replacements proximal by construction. Nevertheless, we will add an explicit discussion of the limitations of uncertainty estimators in offline settings and will include an ablation on the sensitivity of performance to the uncertainty threshold in the revised manuscript. revision: partial
Referee: [Experimental Verification] Experimental Verification section: The controlled bandit experiment is cited to verify the limitation, but without reported error bars, specific exclusion criteria, or tests in higher-dimensional spaces, it does not sufficiently address whether the uncertainty bound holds in realistic offline RL scenarios.

Authors: We will revise the bandit experiment section to report mean and standard deviation over 10 random seeds, state the exact exclusion criteria used to generate suboptimal actions, and add a short paragraph explaining why the low-dimensional continuous bandit isolates the convergence issue without confounding factors. The main D4RL benchmark results already provide higher-dimensional validation of PAR under realistic offline data. revision: yes

Circularity Check

0 steps flagged

No circularity: PAR is an independent plug-in method with self-contained analysis

full rationale

The paper derives the limitation of BC-regularized actor-critic via convergence analysis on a bandit task, then introduces PAR as a replacement rule using target-policy ascent clipped by uncertainty. No equation reduces the claimed improvement to a fitted parameter or prior self-citation by construction; the uncertainty bound and stability claims are stated as assumptions rather than derived tautologies. Experiments treat PAR as an additive module on TD3+BC, with results reported as empirical gains rather than forced identities. This is the normal non-circular case for a methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities beyond the standard offline RL setup; PAR itself is a new algorithmic component whose stability relies on unstated details of the target policy and uncertainty bound.

axioms (1)

domain assumption Standard offline RL assumptions that the static dataset provides sufficient coverage and that value estimates remain reliable enough to guide safe action replacement.
Implicit in any BC-regularized offline actor-critic method.

invented entities (1)

Proximal Action Replacement (PAR) no independent evidence
purpose: Training sample replacer that substitutes suboptimal dataset actions with value-guided improvements.
New algorithmic construct introduced to address the identified ceiling; no independent evidence outside the paper's experiments is provided.

pith-pipeline@v0.9.0 · 5528 in / 1273 out tokens · 67470 ms · 2026-05-16T06:35:10.381452+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1 (Sub-optimality of BC Regularization) ... (ˆπ,ˆQ)̸=(π∗,Q∗) and Q∗(s,ˆπ(s))<Q∗(s,π∗(s))
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.2 (Instability via Policy Divergence) ... min L(Q) ≥ μ·E[∥πθ(s′)−πβ(s′)∥²]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Refining Compositional Diffusion for Reliable Long-Horizon Planning
cs.RO 2026-05 unverdicted novelty 6.0

RCD steers compositional diffusion sampling toward high-density coherent plans by combining reconstruction-error guidance with overlap consistency, outperforming prior methods on locomotion, manipulation, and pixel-ba...