Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning
Pith reviewed 2026-05-16 23:06 UTC · model grok-4.3
The pith
Adaptive Replay Buffer dynamically prioritizes on-policy online data to improve offline-to-online RL
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Adaptive Replay Buffer (ARB) is a learning-free mechanism that computes a lightweight on-policyness score for each collected trajectory, measuring how closely its behavior matches the current policy, and then assigns proportional sampling weights to every transition inside that trajectory. By doing so, the buffer maintains early stability from offline data while progressively focusing learning on the most relevant online experiences, producing both lower early degradation and higher asymptotic performance when added to standard offline-to-online RL algorithms on D4RL tasks.
What carries the argument
The on-policyness metric, a lightweight score that quantifies trajectory alignment with the current policy and sets proportional sampling weights inside the Adaptive Replay Buffer.
If this is right
- ARB integrates into existing O2O RL algorithms without complex additional learning or fixed-ratio tuning.
- The method mitigates early performance degradation during the shift from offline to online data.
- Final asymptotic performance rises across multiple O2O algorithms on D4RL benchmarks.
- The approach stays simple and learning-free, adding negligible computational cost.
Where Pith is reading between the lines
- Behavior-aware sampling may replace fixed mixing ratios in other RL settings where data relevance changes over time.
- Extending the on-policyness score with reward or uncertainty signals could further sharpen data selection.
- Real-robotics tests would check whether the metric remains effective outside simulation assumptions.
- The trajectory-level weighting invites combinations with other prioritization schemes already used in replay buffers.
Load-bearing premise
The on-policyness metric accurately identifies useful data for weighting without introducing bias or requiring domain-specific tuning that affects the claimed gains.
What would settle it
An experiment in which ARB-augmented algorithms produce equal or lower final performance than fixed-ratio baselines on multiple D4RL tasks would show the adaptive weighting does not deliver the reported gains.
Figures
read the original abstract
Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively leverages offline data for initial stability while progressively focusing learning on the most relevant, high-rewarding online experiences. Our extensive experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms, highlighting the importance of an adaptive, behavior-aware replay buffer design. Our code is publicly available at https://github.com/song970407/ARB.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Adaptive Replay Buffer (ARB) for offline-to-online RL, which computes a lightweight 'on-policyness' metric to assign proportional sampling weights to transitions based on alignment with the current policy. This is claimed to replace fixed mixing ratios, mitigate early performance degradation, and improve final performance when integrated into existing O2O algorithms, with supporting experiments on D4RL benchmarks.
Significance. If the on-policyness weighting can be shown to deliver the claimed gains without hidden bias or environment-specific tuning, the method would offer a simple, learning-free improvement to O2O RL pipelines that could be adopted broadly.
major comments (3)
- [Abstract and §3] Abstract and §3 (method): the on-policyness metric is described only at a high level as 'trajectory alignment with current policy' and 'lightweight, learning-free computation'; no closed-form definition, normalization procedure, or pseudocode is supplied, preventing verification that the weighting is bias-free or that it trades off stability versus asymptotic performance as asserted.
- [§4] §4 (experiments): the headline claim of 'consistent gains' and 'significantly improves final performance' on D4RL is stated without baseline implementation details, statistical significance tests, variance across seeds, or ablations that replace the metric with uniform or reward-based sampling; this leaves the causal contribution of ARB untested.
- [§4 and Table 1] §4 and Table 1: no sensitivity analysis or domain-specific tuning results are reported for the on-policyness threshold or weighting function, contradicting the claim that ARB is 'simple to implement' and 'seamlessly integrating' without additional hyperparameters.
minor comments (1)
- [Abstract] The GitHub link is provided but the manuscript does not specify which exact D4RL tasks, algorithms (e.g., CQL, TD3+BC), and hyper-parameters were used, making direct reproduction difficult.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has helped us identify areas for improving clarity and experimental rigor. We address each major comment point-by-point below and indicate the revisions planned for the next manuscript version.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): the on-policyness metric is described only at a high level as 'trajectory alignment with current policy' and 'lightweight, learning-free computation'; no closed-form definition, normalization procedure, or pseudocode is supplied, preventing verification that the weighting is bias-free or that it trades off stability versus asymptotic performance as asserted.
Authors: We agree that the description in §3 would benefit from greater formality. The manuscript currently presents the metric conceptually as the alignment of trajectories with the current policy via action probabilities. In the revised version we will add the exact closed-form expression (average log-probability ratio under the current vs. behavior policy, normalized to [0,1]), the weighting formula, and pseudocode for the sampling step. This will make the bias-free property and stability-performance trade-off explicit and verifiable. revision: yes
-
Referee: [§4] §4 (experiments): the headline claim of 'consistent gains' and 'significantly improves final performance' on D4RL is stated without baseline implementation details, statistical significance tests, variance across seeds, or ablations that replace the metric with uniform or reward-based sampling; this leaves the causal contribution of ARB untested.
Authors: We acknowledge that the experimental section requires additional rigor to support the claims. The revised manuscript will include: full baseline implementation details and code references, mean and standard deviation over five random seeds, statistical significance tests (paired t-tests), and ablations that substitute the on-policyness metric with uniform sampling and reward-based weighting. These additions will isolate the causal contribution of ARB. revision: yes
-
Referee: [§4 and Table 1] §4 and Table 1: no sensitivity analysis or domain-specific tuning results are reported for the on-policyness threshold or weighting function, contradicting the claim that ARB is 'simple to implement' and 'seamlessly integrating' without additional hyperparameters.
Authors: ARB contains no explicit threshold or tunable weighting function; the sampling weight is strictly proportional to the computed on-policyness score. Nevertheless, we agree that empirical robustness should be demonstrated. The revision will add a sensitivity analysis (appendix) showing performance under small perturbations of any scaling constants and across all D4RL domains, confirming that no environment-specific tuning is required. revision: partial
Circularity Check
No circularity in ARB derivation; on-policyness metric defined independently
full rationale
The paper defines the Adaptive Replay Buffer via a new on-policyness metric that directly measures trajectory alignment with the current policy and assigns proportional weights. This definition is presented as a lightweight, learning-free computation without equations that reduce the metric or claimed gains back to fitted parameters, self-referential loops, or prior self-citations. Experimental results on D4RL are empirical outcomes rather than derivations that equate outputs to inputs by construction. No load-bearing self-citation chains, uniqueness theorems, or ansatzes are invoked for the core mechanism.
Axiom & Free-Parameter Ledger
invented entities (1)
-
on-policyness metric
no independent evidence
Forward citations
Cited by 1 Pith paper
-
ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization
ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.
Reference graph
Works this paper leans on
-
[1]
[Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm
For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. [Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] (c) (Optional) Anonymized source code, with specification of all dependencies, including extern...
-
[2]
[Not Applicable] (b) Complete proofs of all theoretical results
For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. [Not Applicable] (b) Complete proofs of all theoretical results. [Not Applicable] (c) Clear explanations of any assumptions. [Yes]
-
[3]
[Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen)
For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to reproduce the main experimental results (ei- ther in the supplemental material or as a URL). [Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen). [Yes] (c) A clear definition of the spe...
-
[4]
[Yes] (b) The license information of the assets, if ap- plicable
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. [Yes] (b) The license information of the assets, if ap- plicable. [Yes] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Yes] (d) Information ...
-
[5]
If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Appli- cable] (c) The estimated hourly wage paid...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.