arxiv: 2512.10510 · v2 · submitted 2025-12-11 · 💻 cs.LG · cs.AI

Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

Chihyeon Song , Jaewoo Lee , Jinkyoo Park This is my paper

Pith reviewed 2026-05-16 23:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline-to-online reinforcement learningadaptive replay bufferon-policyness metricD4RL benchmarksdata sampling weightspolicy alignmentO2O RL algorithms

0 comments

The pith

Adaptive Replay Buffer dynamically prioritizes on-policy online data to improve offline-to-online RL

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline-to-online reinforcement learning must balance a fixed offline dataset with newly collected online experiences, yet fixed mixing ratios often cause early performance drops or cap final gains. The paper introduces the Adaptive Replay Buffer as a lightweight add-on that scores each trajectory by its alignment with the current policy using a simple on-policyness metric and then samples transitions with weights proportional to that score. This design lets offline data support initial stability while the method gradually shifts emphasis to the most relevant high-reward online experiences. Experiments on D4RL benchmarks show that inserting ARB into several existing O2O algorithms reduces early degradation and raises final performance without extra learning steps or complex tuning.

Core claim

The Adaptive Replay Buffer (ARB) is a learning-free mechanism that computes a lightweight on-policyness score for each collected trajectory, measuring how closely its behavior matches the current policy, and then assigns proportional sampling weights to every transition inside that trajectory. By doing so, the buffer maintains early stability from offline data while progressively focusing learning on the most relevant online experiences, producing both lower early degradation and higher asymptotic performance when added to standard offline-to-online RL algorithms on D4RL tasks.

What carries the argument

The on-policyness metric, a lightweight score that quantifies trajectory alignment with the current policy and sets proportional sampling weights inside the Adaptive Replay Buffer.

If this is right

ARB integrates into existing O2O RL algorithms without complex additional learning or fixed-ratio tuning.
The method mitigates early performance degradation during the shift from offline to online data.
Final asymptotic performance rises across multiple O2O algorithms on D4RL benchmarks.
The approach stays simple and learning-free, adding negligible computational cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Behavior-aware sampling may replace fixed mixing ratios in other RL settings where data relevance changes over time.
Extending the on-policyness score with reward or uncertainty signals could further sharpen data selection.
Real-robotics tests would check whether the metric remains effective outside simulation assumptions.
The trajectory-level weighting invites combinations with other prioritization schemes already used in replay buffers.

Load-bearing premise

The on-policyness metric accurately identifies useful data for weighting without introducing bias or requiring domain-specific tuning that affects the claimed gains.

What would settle it

An experiment in which ARB-augmented algorithms produce equal or lower final performance than fixed-ratio baselines on multiple D4RL tasks would show the adaptive weighting does not deliver the reported gains.

Figures

Figures reproduced from arXiv: 2512.10510 by Chihyeon Song, Jaewoo Lee, Jinkyoo Park.

**Figure 1.** Figure 1: Online data ratio of the minibatch over environment steps for different hopper datasets with FamO2O Furthermore, a critical finding emerges when the offline dataset’s average reward is low. In these cases, ARB’s online data ratio curve rises sharply, a behavior not observed in other methods. This effect provides direct evidence of ARB’s adaptive prioritization mechanism. By performing on-the-fly priorit… view at source ↗

**Figure 3.** Figure 3: Normalized scores and online data ratios [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 2.** Figure 2: Online data ratio and normalized score pre [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively leverages offline data for initial stability while progressively focusing learning on the most relevant, high-rewarding online experiences. Our extensive experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms, highlighting the importance of an adaptive, behavior-aware replay buffer design. Our code is publicly available at https://github.com/song970407/ARB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARB introduces a lightweight on-policyness metric for adaptive replay in O2O RL, but missing metric definitions, ablations, and stats leave the D4RL gains hard to evaluate.

read the letter

The main contribution is a learning-free adaptive replay buffer that weights transitions by how closely their trajectories match the current policy's behavior. This replaces fixed mixing ratios with dynamic prioritization, aiming to keep early stability from offline data while shifting focus to useful online samples as training progresses. The approach is simple enough to plug into existing O2O algorithms without extra networks, which is a practical step forward for a common pain point in the area. Releasing the code also helps anyone who wants to test the idea directly. The high-level motivation holds up: standard methods do struggle with the stability-versus-asymptotics trade-off, and a behavior-aware buffer makes sense as a fix. The D4RL results are presented as consistent improvements across algorithms, which would be useful if they hold. The soft spots are mostly around evidence. The on-policyness metric lacks a closed-form definition or normalization details in the available description, so it is difficult to judge whether it truly measures policy alignment or simply correlates with reward or length. No ablations appear that would isolate its contribution, such as swapping it for uniform or reward-based sampling. The experiments are called extensive but come without statistical tests, variance numbers, or clear baseline breakdowns, which weakens the central claim. The stress-test concern about potential bias in the weighting scheme is reasonable given the gaps. This paper is aimed at RL researchers working on offline-to-online transfer, especially in control or robotics settings. A reader could extract the basic strategy and try the code, but extending or trusting the results would require the missing implementation pieces. It shows honest engagement with the problem and the literature, so it deserves peer review to check the full details and verify the numbers rather than a desk reject.

Referee Report

3 major / 1 minor

Summary. The paper introduces Adaptive Replay Buffer (ARB) for offline-to-online RL, which computes a lightweight 'on-policyness' metric to assign proportional sampling weights to transitions based on alignment with the current policy. This is claimed to replace fixed mixing ratios, mitigate early performance degradation, and improve final performance when integrated into existing O2O algorithms, with supporting experiments on D4RL benchmarks.

Significance. If the on-policyness weighting can be shown to deliver the claimed gains without hidden bias or environment-specific tuning, the method would offer a simple, learning-free improvement to O2O RL pipelines that could be adopted broadly.

major comments (3)

[Abstract and §3] Abstract and §3 (method): the on-policyness metric is described only at a high level as 'trajectory alignment with current policy' and 'lightweight, learning-free computation'; no closed-form definition, normalization procedure, or pseudocode is supplied, preventing verification that the weighting is bias-free or that it trades off stability versus asymptotic performance as asserted.
[§4] §4 (experiments): the headline claim of 'consistent gains' and 'significantly improves final performance' on D4RL is stated without baseline implementation details, statistical significance tests, variance across seeds, or ablations that replace the metric with uniform or reward-based sampling; this leaves the causal contribution of ARB untested.
[§4 and Table 1] §4 and Table 1: no sensitivity analysis or domain-specific tuning results are reported for the on-policyness threshold or weighting function, contradicting the claim that ARB is 'simple to implement' and 'seamlessly integrating' without additional hyperparameters.

minor comments (1)

[Abstract] The GitHub link is provided but the manuscript does not specify which exact D4RL tasks, algorithms (e.g., CQL, TD3+BC), and hyper-parameters were used, making direct reproduction difficult.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us identify areas for improving clarity and experimental rigor. We address each major comment point-by-point below and indicate the revisions planned for the next manuscript version.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the on-policyness metric is described only at a high level as 'trajectory alignment with current policy' and 'lightweight, learning-free computation'; no closed-form definition, normalization procedure, or pseudocode is supplied, preventing verification that the weighting is bias-free or that it trades off stability versus asymptotic performance as asserted.

Authors: We agree that the description in §3 would benefit from greater formality. The manuscript currently presents the metric conceptually as the alignment of trajectories with the current policy via action probabilities. In the revised version we will add the exact closed-form expression (average log-probability ratio under the current vs. behavior policy, normalized to [0,1]), the weighting formula, and pseudocode for the sampling step. This will make the bias-free property and stability-performance trade-off explicit and verifiable. revision: yes
Referee: [§4] §4 (experiments): the headline claim of 'consistent gains' and 'significantly improves final performance' on D4RL is stated without baseline implementation details, statistical significance tests, variance across seeds, or ablations that replace the metric with uniform or reward-based sampling; this leaves the causal contribution of ARB untested.

Authors: We acknowledge that the experimental section requires additional rigor to support the claims. The revised manuscript will include: full baseline implementation details and code references, mean and standard deviation over five random seeds, statistical significance tests (paired t-tests), and ablations that substitute the on-policyness metric with uniform sampling and reward-based weighting. These additions will isolate the causal contribution of ARB. revision: yes
Referee: [§4 and Table 1] §4 and Table 1: no sensitivity analysis or domain-specific tuning results are reported for the on-policyness threshold or weighting function, contradicting the claim that ARB is 'simple to implement' and 'seamlessly integrating' without additional hyperparameters.

Authors: ARB contains no explicit threshold or tunable weighting function; the sampling weight is strictly proportional to the computed on-policyness score. Nevertheless, we agree that empirical robustness should be demonstrated. The revision will add a sensitivity analysis (appendix) showing performance under small perturbations of any scaling constants and across all D4RL domains, confirming that no environment-specific tuning is required. revision: partial

Circularity Check

0 steps flagged

No circularity in ARB derivation; on-policyness metric defined independently

full rationale

The paper defines the Adaptive Replay Buffer via a new on-policyness metric that directly measures trajectory alignment with the current policy and assigns proportional weights. This definition is presented as a lightweight, learning-free computation without equations that reduce the metric or claimed gains back to fitted parameters, self-referential loops, or prior self-citations. Experimental results on D4RL are empirical outcomes rather than derivations that equate outputs to inputs by construction. No load-bearing self-citation chains, uniqueness theorems, or ansatzes are invoked for the core mechanism.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced on-policyness metric for data prioritization. No explicit free parameters, background axioms, or external validation of the metric are described in the abstract.

invented entities (1)

on-policyness metric no independent evidence
purpose: Lightweight score measuring trajectory alignment with current policy to determine sampling weights
Introduced as the key novel component; no independent evidence or external validation supplied in the abstract.

pith-pipeline@v0.9.0 · 5514 in / 1140 out tokens · 37642 ms · 2026-05-16T23:06:23.234437+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization
cs.LG 2026-05 unverdicted novelty 6.0

ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · cited by 1 Pith paper

[1]

[Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. [Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] (c) (Optional) Anonymized source code, with specification of all dependencies, including extern...

work page
[2]

[Not Applicable] (b) Complete proofs of all theoretical results

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. [Not Applicable] (b) Complete proofs of all theoretical results. [Not Applicable] (c) Clear explanations of any assumptions. [Yes]

work page
[3]

[Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen)

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to reproduce the main experimental results (ei- ther in the supplemental material or as a URL). [Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen). [Yes] (c) A clear definition of the spe...

work page
[4]

[Yes] (b) The license information of the assets, if ap- plicable

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. [Yes] (b) The license information of the assets, if ap- plicable. [Yes] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Yes] (d) Information ...

work page
[5]

[Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Appli- cable] (c) The estimated hourly wage paid...

work page 2022