Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning
Pith reviewed 2026-05-16 06:35 UTC · model grok-4.3
The pith
Proximal action replacement overcomes the imitation ceiling in BC-regularized actor-critic by substituting suboptimal dataset actions with value-guided improvements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that indiscriminate behavior cloning imposes a structural limit on actor-critic learning in offline settings when dataset actions are suboptimal, preventing full exploitation of value function suggestions. By replacing those actions with proximal actions from the target policy guided by Q-function ascent and uncertainty bounds, the optimization can escape this ceiling while preserving training stability.
What carries the argument
Proximal Action Replacement (PAR), which substitutes dataset actions in training samples with improved actions sampled from the target policy under local maximization of the action-value function, constrained by value uncertainty to avoid instability.
If this is right
- PAR boosts performance when added to TD3+BC across multiple offline RL benchmarks.
- PAR is compatible with different BC regularization approaches.
- PAR allows the actor to exploit better actions suggested by the value function without destabilizing training.
- The method approaches state-of-the-art results using only the basic TD3+BC base algorithm.
Where Pith is reading between the lines
- PAR might generalize to other offline RL algorithms beyond actor-critic methods.
- Implementing PAR could simplify the design of offline RL systems by reducing reliance on sophisticated regularization techniques.
- Further tests in high-dimensional or discrete action spaces could reveal additional benefits or limitations of the replacement strategy.
Load-bearing premise
The target policy remains stable enough that its generated actions can replace dataset actions without causing training divergence or introducing harmful bias.
What would settle it
Observing that adding PAR to TD3+BC results in lower average returns or training instability on D4RL benchmark tasks compared to the baseline without PAR.
read the original abstract
Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that BC-regularized actor-critic methods in offline RL suffer from a performance ceiling when dataset actions are suboptimal, as analyzed through convergence properties and verified on a continuous bandit task. To overcome this, it introduces Proximal Action Replacement (PAR), which replaces dataset actions with improved actions from a stable target policy using local gradient ascent on the Q-function, bounded by value uncertainty. PAR is shown to be compatible with BC paradigms and, when combined with TD3+BC, consistently improves performance on offline RL benchmarks, approaching state-of-the-art results.
Significance. If the central claims hold, PAR represents a straightforward enhancement to existing offline RL methods that could allow better utilization of value function information without destabilizing training. The formal analysis and empirical gains on benchmarks would contribute to understanding and improving BC-based approaches in offline settings.
major comments (3)
- [Convergence Analysis] Convergence Analysis section: The formal investigation of convergence properties of BC-regularized actor-critic optimization is load-bearing for identifying the performance ceiling, but the abstract provides no equations or key steps, making it difficult to assess how PAR specifically addresses the limitation without reducing to prior quantities.
- [PAR Description] PAR Description section: The claim that bounding by value uncertainty σ(s,a) ensures training stability and prevents OOD actions is central, yet in offline RL the critic's extrapolation error can cause uncertainty estimators to underestimate far from data support, potentially allowing destabilizing replacements as training progresses.
- [Experimental Verification] Experimental Verification section: The controlled bandit experiment is cited to verify the limitation, but without reported error bars, specific exclusion criteria, or tests in higher-dimensional spaces, it does not sufficiently address whether the uncertainty bound holds in realistic offline RL scenarios.
minor comments (2)
- [Abstract] Abstract: The phrase 'approaches state-of-the-art results' would benefit from specifying the particular SOTA methods and metrics used for comparison.
- [Throughout] Throughout the manuscript: Ensure all invented entities like 'Proximal Action Replacement (PAR)' are clearly defined upon first use with consistent notation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our convergence analysis and the practical considerations for PAR. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Convergence Analysis] Convergence Analysis section: The formal investigation of convergence properties of BC-regularized actor-critic optimization is load-bearing for identifying the performance ceiling, but the abstract provides no equations or key steps, making it difficult to assess how PAR specifically addresses the limitation without reducing to prior quantities.
Authors: We agree that the abstract would benefit from including the key convergence result to make the performance ceiling explicit. In the revised version we will add the main equation from Section 3 (the fixed-point relation showing that the BC-regularized actor converges to a convex combination of dataset actions) and a one-sentence statement of how PAR breaks this fixed point by replacing actions with target-policy improvements. revision: yes
-
Referee: [PAR Description] PAR Description section: The claim that bounding by value uncertainty σ(s,a) ensures training stability and prevents OOD actions is central, yet in offline RL the critic's extrapolation error can cause uncertainty estimators to underestimate far from data support, potentially allowing destabilizing replacements as training progresses.
Authors: This is a legitimate concern about uncertainty estimation under extrapolation. Our current bound is applied only to a single local gradient step from the original dataset action, which keeps replacements proximal by construction. Nevertheless, we will add an explicit discussion of the limitations of uncertainty estimators in offline settings and will include an ablation on the sensitivity of performance to the uncertainty threshold in the revised manuscript. revision: partial
-
Referee: [Experimental Verification] Experimental Verification section: The controlled bandit experiment is cited to verify the limitation, but without reported error bars, specific exclusion criteria, or tests in higher-dimensional spaces, it does not sufficiently address whether the uncertainty bound holds in realistic offline RL scenarios.
Authors: We will revise the bandit experiment section to report mean and standard deviation over 10 random seeds, state the exact exclusion criteria used to generate suboptimal actions, and add a short paragraph explaining why the low-dimensional continuous bandit isolates the convergence issue without confounding factors. The main D4RL benchmark results already provide higher-dimensional validation of PAR under realistic offline data. revision: yes
Circularity Check
No circularity: PAR is an independent plug-in method with self-contained analysis
full rationale
The paper derives the limitation of BC-regularized actor-critic via convergence analysis on a bandit task, then introduces PAR as a replacement rule using target-policy ascent clipped by uncertainty. No equation reduces the claimed improvement to a fitted parameter or prior self-citation by construction; the uncertainty bound and stability claims are stated as assumptions rather than derived tautologies. Experiments treat PAR as an additive module on TD3+BC, with results reported as empirical gains rather than forced identities. This is the normal non-circular case for a methods paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard offline RL assumptions that the static dataset provides sufficient coverage and that value estimates remain reliable enough to guide safe action replacement.
invented entities (1)
-
Proximal Action Replacement (PAR)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.1 (Sub-optimality of BC Regularization) ... (ˆπ,ˆQ)̸=(π∗,Q∗) and Q∗(s,ˆπ(s))<Q∗(s,π∗(s))
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.2 (Instability via Policy Divergence) ... min L(Q) ≥ μ·E[∥πθ(s′)−πβ(s′)∥²]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Refining Compositional Diffusion for Reliable Long-Horizon Planning
RCD steers compositional diffusion sampling toward high-density coherent plans by combining reconstruction-error guidance with overlap consistency, outperforming prior methods on locomotion, manipulation, and pixel-ba...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.