arxiv: 2605.14497 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Recognition: no theorem link

ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

Letian Yang (1) , Xu Liu (1) , Yiqiang Lu (2) , Jian Liu (2) , Weiqiang Wang (2) , Shuai Li (1) ((1) Shanghai Jiao Tong University , Shanghai , China

show 2 more authors

(2) Ant Group China)

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline-to-online reinforcement learningadaptive data mixingbi-level optimizationmulti-armed banditdata replayQ-learningreinforcement learning

0 comments

The pith

ROAD frames data mixing in offline-to-online reinforcement learning as a bi-level optimization problem solved by a multi-armed bandit to automate replay ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline-to-online reinforcement learning must bridge a stable but fixed offline dataset with an evolving online policy, yet non-stationary shifts make fixed mixing ratios or hand-crafted heuristics unreliable. ROAD treats the choice of mixing proportions as an outer-level meta-decision whose quality is judged by the final policy performance achieved after inner-level Q-learning updates. A tractable surrogate objective inside a multi-armed bandit approximates the bi-level gradient and selects data batches on the fly, preserving useful offline priors while limiting value overestimation. The resulting plug-and-play method removes the need for environment-specific tuning and delivers higher stability plus stronger asymptotic performance than prior replay strategies across multiple datasets.

Core claim

The central claim is that casting the data-mixing strategy as the outer level of a bi-level program—where the outer objective measures downstream policy return and the inner level runs ordinary Q-learning—permits an automated, adaptive replay process. This outer decision is made practical by a multi-armed bandit whose arms are candidate mixing weights and whose reward signal comes from a surrogate that approximates the true bi-level gradient while guarding against overestimation.

What carries the argument

Bi-level optimization in which the outer level selects data-mixing weights to maximize eventual policy performance and the inner level performs standard Q-learning updates, made tractable by a multi-armed bandit driven by a surrogate gradient.

If this is right

Mixing ratios adapt automatically to changing training dynamics without manual retuning.
Offline priors are retained while value overestimation is limited during online updates.
Stability improves and asymptotic performance rises relative to static or heuristic baselines.
The same framework can be dropped into existing offline-to-online pipelines as a plug-in module.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same outer-inner split could be applied to other non-stationary RL problems such as continual learning or sim-to-real transfer.
If the surrogate proves reliable, similar bi-level formulations might automate other meta-decisions like network architecture or reward shaping inside RL training loops.
Practical deployments in robotics or recommendation systems might see reduced hyper-parameter search time once the bandit overhead is amortized.

Load-bearing premise

The surrogate objective inside the multi-armed bandit approximates the true bi-level gradient well enough that optimizing the surrogate actually improves final policy performance.

What would settle it

In a controlled environment, replacing the bandit-driven mixing with a well-tuned static ratio yields strictly higher final returns than ROAD; this single reversal would falsify the claim of consistent outperformance.

Figures

Figures reproduced from arXiv: 2605.14497 by (2) Ant Group, China, China), Jian Liu (2), Letian Yang (1), Shanghai, Shuai Li (1) ((1) Shanghai Jiao Tong University, Weiqiang Wang (2), Xu Liu (1), Yiqiang Lu (2).

**Figure 1.** Figure 1: Illustration of the online training scheme of ROAD. In the online training phase, ROAD controls the offline data replay pattern by [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Heatmap for mixing ratio selection in ROAD. In some cases, ROAD could learn to converge to the mixing ratio that performs best [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The Rq curve in online fine-tuning. improved learning outcomes and more effective integration of data sources. 5.3 Ablation and Parameter Sensitivity Study The hyperparameter κ. This hyperparameter effectively balances aggressive policy improvement and conservative offline data utilization. Two extremes are κ = 0 and κ → ∞ corresponding to ablating the contributions of the offline and online components of… view at source ↗

**Figure 4.** Figure 4: An example of offline-to-online RL with various dataset qualities and the evaluated policy quality. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: A selection of benchmark tasks in Antmaze Task Domain (left), Locomotion Task Domain (middle), and Adroit Task Domain [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized Returns of ROAD and the baseline methods with Cal-QL [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Normalized Returns of ROAD and the baseline methods with CQL [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Normalized Returns of ROAD and the fixed mixing ratio approaches with RLPD [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Examining ROAD under the setting of online RL with offline data without offline pretraining with various dataset qualities. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely on static mixing ratios or heuristic-based replay strategies, which lack adaptability to different environments and varying training dynamics, resulting in suboptimal tradeoff between stability and asymptotic performance. In this work, we propose Reinforcement Learning with Optimized Adaptive Data-mixing (ROAD), a dynamic plug-and-play framework that automates the data replay process. We identify a fundamental objective misalignment in existing approaches. To tackle this, we formulate the data selection problem as a bi-level optimization process, interpreting the data mixing strategy as a meta-decision governing the policy performance (outer-level) during online fine-tuning, while the conventional Q-learning updates operate at the inner level. To make it tractable, we propose a practical algorithm using a multi-armed bandit mechanism. This is guided by a surrogate objective approximating the bi-level gradient, which simultaneously maintains offline priors and prevents value overestimation. Our empirical results demonstrate that this approach consistently outperforms existing data replay methods across various datasets, eliminating the need for manual, context-specific adjustments while achieving superior stability and asymptotic performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The bi-level framing for adaptive data mixing is a clean conceptual move, but the surrogate's ability to track the true outer objective under distribution shift remains the open question.

read the letter

The core contribution is treating the data-mixing ratios as outer-level decisions in a bi-level setup, with a multi-armed bandit using a surrogate to approximate the effect on final policy return. This moves past fixed-ratio or heuristic replay buffers by trying to automate the stability-versus-asymptotic-performance tradeoff in offline-to-online RL. The abstract does a good job naming the misalignment in prior work and showing how the inner Q-learning and outer mixing interact. That framing is useful on its own. The MAB surrogate is presented as a practical way to avoid full bi-level gradients while keeping offline priors and curbing overestimation, which is a reasonable engineering choice if it holds up. The reported experiments claim consistent gains across datasets without manual tuning, which would be valuable if reproducible. The main soft spot is exactly the one the stress-test flags: the surrogate is a static proxy that does not re-evaluate the outer objective after each inner update. In non-stationary online fine-tuning, approximation error can grow with policy drift, potentially leading the bandit to favor early stability at the cost of later performance. Without the explicit equations, parameter choices for the bandit, or ablations showing the surrogate tracks the true gradient, it is hard to know whether the gains are robust or tied to particular environments. The abstract also lacks error bars and details on how the bandit parameters were set, so the empirical claims feel preliminary rather than definitive. This is for researchers building practical offline-to-online pipelines who need an automated replay strategy. It shows clear thinking on the problem and engages the literature on replay methods, so it deserves a serious referee even though the current write-up needs tighter math and validation on the surrogate before the performance claims can be taken at face value. I would send it to review with requests for the derivation and targeted checks on approximation quality.

Referee Report

2 major / 1 minor

Summary. The paper proposes ROAD, a plug-and-play framework for offline-to-online RL that formulates data mixing as a bi-level optimization problem: the outer level selects mixing ratios via a multi-armed bandit guided by a surrogate objective approximating the bi-level gradient, while the inner level performs standard Q-learning. The surrogate is claimed to maintain offline priors and prevent overestimation; empirical results assert consistent outperformance over existing replay methods across datasets without manual tuning, yielding better stability and asymptotic performance.

Significance. If the surrogate reliably tracks the true outer objective under non-stationary shifts, the method would automate a key hyperparameter in offline-to-online RL and reduce reliance on environment-specific heuristics, providing a principled alternative to static mixing ratios.

major comments (2)

[Abstract / §3] Abstract and §3 (bi-level formulation): the surrogate objective is described only at a high level as 'approximating the bi-level gradient' while 'maintaining offline priors and preventing value overestimation'; without the explicit functional form or derivation showing how the MAB reward is computed from inner-level trajectories, it is impossible to verify whether the outer decisions actually improve final policy return rather than merely stabilizing early training.
[Experiments] Empirical evaluation: no error bars, no description of how the multi-armed bandit exploration parameters (listed as free parameters) are chosen or tuned across environments, and no ablation isolating the surrogate's contribution; this leaves the claim of 'eliminating the need for manual, context-specific adjustments' unsupported and raises reproducibility concerns.

minor comments (1)

[§2] Notation for the outer-level objective and inner-level Q-update should be introduced with explicit equations rather than prose descriptions to clarify the claimed misalignment in existing methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and reproducibility of our work. We address each major point below and commit to revisions that strengthen the presentation of the surrogate objective and experimental details without altering the core claims.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (bi-level formulation): the surrogate objective is described only at a high level as 'approximating the bi-level gradient' while 'maintaining offline priors and preventing value overestimation'; without the explicit functional form or derivation showing how the MAB reward is computed from inner-level trajectories, it is impossible to verify whether the outer decisions actually improve final policy return rather than merely stabilizing early training.

Authors: We agree that the current description of the surrogate in the abstract and §3 is at a high level and would benefit from an explicit functional form. In the revised manuscript we will add a dedicated subsection (new §3.3) that derives the surrogate reward explicitly: the MAB reward at step t is defined as r_t = Q_outer(π_{t+1}; D_val) - Q_outer(π_t; D_val), where Q_outer is a first-order approximation of the bi-level gradient obtained by unrolling one inner-level Q-learning step on a held-out validation split of the offline dataset. This derivation shows how the surrogate penalizes value overestimation while preserving offline priors. The added equations and a short proof sketch will allow direct verification that outer-level decisions target final policy return. revision: yes
Referee: [Experiments] Empirical evaluation: no error bars, no description of how the multi-armed bandit exploration parameters (listed as free parameters) are chosen or tuned across environments, and no ablation isolating the surrogate's contribution; this leaves the claim of 'eliminating the need for manual, context-specific adjustments' unsupported and raises reproducibility concerns.

Authors: We acknowledge these gaps in the current experimental section. In the revision we will (i) report mean and standard deviation over five independent random seeds for all curves and tables, (ii) add a new paragraph in §4.1 that fixes the MAB parameters to standard values (ε=0.1, α=0.5) chosen once on a single environment and held constant across all others, together with a sensitivity plot, and (iii) include an ablation that replaces the surrogate reward with uniform random mixing while keeping the rest of the algorithm identical. These additions will directly support the claim of reduced manual tuning and address reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: bi-level formulation and surrogate MAB remain independent of fitted inputs

full rationale

The paper formulates data mixing as an outer-level meta-decision and Q-learning as the inner level, then introduces a multi-armed bandit guided by a surrogate that approximates the bi-level gradient while preserving offline priors. No equations or definitions are provided in which the surrogate objective is constructed directly from the same data quantities that the inner loop optimizes, nor does any step rename a fitted parameter as a prediction. The central claim rests on the empirical superiority of the resulting adaptive mixing rather than on any self-referential reduction or self-citation chain that would force the outcome by construction. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the tractability of the surrogate approximation and on the assumption that the bandit can optimize the outer objective without introducing bias that harms policy performance.

free parameters (1)

multi-armed bandit exploration parameters
The bandit mechanism requires at least one exploration or temperature parameter whose value is not derived from first principles.

axioms (1)

domain assumption The surrogate objective approximates the bi-level gradient sufficiently well to preserve offline priors and prevent value overestimation
Invoked to justify replacing the true outer-level gradient with a bandit-compatible surrogate.

pith-pipeline@v0.9.0 · 5563 in / 1231 out tokens · 41483 ms · 2026-05-15T01:29:05.909847+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

[1]

Efficient online reinforcement learning with offline data

[Ballet al., 2023 ] Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR,

work page 2023
[2]

MOORL: A frame- work for integrating offline-online reinforcement learning

[Chaudharyet al., 2025 ] Gaurav Chaudhary, Washim Uddin Mondal, and Laxmidhar Behera. MOORL: A frame- work for integrating offline-online reinforcement learning. Transactions on Machine Learning Research,

work page 2025
[3]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

[Fuet al., 2020 ] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learn- ing with a stochastic actor

[Haarnojaet al., 2018 ] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learn- ing with a stochastic actor. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceed- ings of Machine Learning Research,...

work page 2018
[5]

Modem: Accelerating visual model-based reinforcement learning with demonstrations

[Hansenet al., 2023 ] Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, and Aravind Rajeswaran. Modem: Accelerating visual model-based reinforcement learning with demonstrations. InThe Eleventh Interna- tional Conference on Learning Representations,

work page 2023
[6]

Deep q-learning from demonstrations

[Hesteret al., 2018 ] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Hor- gan, John Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. InProceedings of the AAAI conference on artificial intelligence, volume 32,

work page 2018
[7]

Kakade and John Langford

[Kakade and Langford, 2002] Sham M. Kakade and John Langford. Approximately optimal approximate reinforce- ment learning. InInternational Conference on Machine Learning,

work page 2002
[8]

Scalable deep reinforcement learn- ing for vision-based robotic manipulation

[Kalashnikovet al., 2018 ] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vin- cent Vanhoucke, et al. Scalable deep reinforcement learn- ing for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR,

work page 2018
[9]

Offline reinforcement learning with im- plicit q-learning

[Kostrikovet al., 2022 ] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with im- plicit q-learning. InInternational Conference on Learning Representations,

work page 2022
[10]

Conservative q-learning for offline reinforcement learning

[Kumaret al., 2020 ] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In H. Larochelle, M. Ran- zato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Ad- vances in Neural Information Processing Systems, vol- ume 33, pages 1179–1191. Curran Associates, Inc.,

work page 2020
[11]

On the sample complexity of actor-critic method for reinforcement learning with function approxi- mation.Machine Learning, 112(7):2433–2467,

[Kumaret al., 2023 ] Harshat Kumar, Alec Koppel, and Ale- jandro Ribeiro. On the sample complexity of actor-critic method for reinforcement learning with function approxi- mation.Machine Learning, 112(7):2433–2467,

work page 2023
[12]

Exploration in deep reinforce- ment learning: A survey.Information Fusion, 85:1–22,

[Ladoszet al., 2022 ] Pawel Ladosz, Lilian Weng, Minwoo Kim, and Hyondong Oh. Exploration in deep reinforce- ment learning: A survey.Information Fusion, 85:1–22,

work page 2022
[13]

Offline-to-online re- inforcement learning via balanced replay and pessimistic q-ensemble

[Leeet al., 2022 ] Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online re- inforcement learning via balanced replay and pessimistic q-ensemble. InConference on Robot Learning, pages 1702–1712. PMLR,

work page 2022
[14]

Offline reinforcement learning: Tu- torial, review, and perspectives on open problems,

[Levineet al., 2020 ] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tu- torial, review, and perspectives on open problems,

work page 2020
[15]

Proto: Iterative policy regularized offline-to-online reinforcement learn- ing.arXiv preprint arXiv:2305.15669,

[Liet al., 2023 ] Jianxiong Li, Xiao Hu, Haoran Xu, Jingjing Liu, Xianyuan Zhan, and Ya-Qin Zhang. Proto: Iterative policy regularized offline-to-online reinforcement learn- ing.arXiv preprint arXiv:2305.15669,

work page arXiv 2023
[16]

The three regimes of offline-to-online re- inforcement learning.arXiv preprint arXiv:2510.01460,

[Liet al., 2025a ] Lu Li, Tianwei Ni, Yihao Sun, and Pierre- Luc Bacon. The three regimes of offline-to-online re- inforcement learning.arXiv preprint arXiv:2510.01460,

work page arXiv
[17]

[Liuet al., 2025 ] Xuefeng Liu, Hung T. C. Le, Siyu Chen, Rick Stevens, Zhuoran Yang, Matthew Walter, and Yuxin Chen. Active advantage-aligned online reinforcement learning with offline data. InThe Exploration in AI To- day Workshop at ICML 2025,

work page 2025
[18]

Reinforcement learning for combinatorial optimization: A survey.Com- puters & Operations Research, 134:105400,

[Mazyavkinaet al., 2021 ] Nina Mazyavkina, Sergey Sviri- dov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey.Com- puters & Operations Research, 134:105400,

work page 2021
[19]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

[Nairet al., 2020 ] Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online re- inforcement learning with offline datasets.arXiv preprint arXiv:2006.09359,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning

[Nakamotoet al., 2023 ] Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36:62244–62269,

work page 2023
[21]

Functional bilevel optimization for machine learning

[Petrulionyt˙eet al., 2024 ] Ieva Petrulionyt ˙e, Julien Mairal, and Michael Arbel. Functional bilevel optimization for machine learning. InThe Thirty-eighth Annual Confer- ence on Neural Information Processing Systems,

work page 2024
[22]

Trust region policy optimization

[Schulmanet al., 2015 ] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR,

work page 2015
[23]

arXiv preprint arXiv:2210.06718 , year=

[Songet al., 2022 ] Yuda Song, Yifei Zhou, Ayush Sekhari, J Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718,

work page arXiv 2022
[24]

Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

[Songet al., 2025 ] Chihyeon Song, Jaewoo Lee, and Jinkyoo Park. Adaptive replay buffer for offline-to-online reinforcement learning.arXiv preprint arXiv:2512.10510,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Mujoco: A physics engine for model-based con- trol

[Todorovet al., 2012 ] Emanuel Todorov, Tom Erez, and Yu- val Tassa. Mujoco: A physics engine for model-based con- trol. In2012 IEEE/RSJ international conference on intel- ligent robots and systems, pages 5026–5033. IEEE,

work page 2012
[26]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

[Veceriket al., 2017 ] Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Roth¨orl, Thomas Lampe, and Martin Ried- miller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards.arXiv preprint arXiv:1707.08817,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Leveraging offline data in online reinforcement learning

[Wagenmaker and Pacchiano, 2023] Andrew Wagenmaker and Aldo Pacchiano. Leveraging offline data in online reinforcement learning. InInternational Conference on Machine Learning, pages 35300–35338. PMLR,

work page 2023
[28]

Deep reinforcement learning: A survey.IEEE Trans

[Wanget al., 2024 ] Xu Wang, Sen Wang, Xingxing Liang, Dawei Zhao, Jincai Huang, Xin Xu, Bin Dai, and Qiguang Miao. Deep reinforcement learning: A survey.IEEE Trans. Neural Networks Learn. Syst., 35(4):5064–5078, April

work page 2024
[29]

Simlauncher: Launching sample-efficient real- world robotic reinforcement learning via simulation pre- training,

[Wuet al., 2025 ] Mingdong Wu, Lehong Wu, Yizhuo Wu, Weiyao Huang, Hongwei Fan, Zheyuan Hu, Haoran Geng, Jinzhou Li, Jiahe Ying, Long Yang, Yuanpei Chen, and Hao Dong. Simlauncher: Launching sample-efficient real- world robotic reinforcement learning via simulation pre- training,

work page 2025
[30]

Policy finetuning: Bridging sample-efficient offline and online reinforcement learn- ing.Advances in neural information processing systems, 34:27395–27407,

[Xieet al., 2021 ] Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learn- ing.Advances in neural information processing systems, 34:27395–27407,

work page 2021
[31]

Reinforcement learning in healthcare: A survey.ACM Comput

[Yuet al., 2021 ] Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. Reinforcement learning in healthcare: A survey.ACM Comput. Surv., 55(1), November

work page 2021
[32]

Policy expansion for bridging offline-to-online rein- forcement learning

[Zhanget al., 2023 ] Haichao Zhang, Wei Xu, and Haonan Yu. Policy expansion for bridging offline-to-online rein- forcement learning. InThe Eleventh International Confer- ence on Learning Representations,

work page 2023
[33]

Adaptive behav- ior cloning regularization for stable offline-to-online re- inforcement learning.arXiv preprint arXiv:2210.13846,

[Zhaoet al., 2022 ] Yi Zhao, Rinu Boney, Alexander Ilin, Juho Kannala, and Joni Pajarinen. Adaptive behav- ior cloning regularization for stable offline-to-online re- inforcement learning.arXiv preprint arXiv:2210.13846,

work page arXiv 2022
[34]

A survey of au- tonomous driving from a deep learning perspective.ACM Computing Surveys, 57(10):1–60,

[Zhaoet al., 2025 ] Jingyuan Zhao, Yuyan Wu, Rui Deng, Susu Xu, Jinpeng Gao, and Andrew Burke. A survey of au- tonomous driving from a deep learning perspective.ACM Computing Surveys, 57(10):1–60,

work page 2025
[35]

Adaptive pol- icy learning for offline-to-online reinforcement learning

[Zhenget al., 2023 ] Han Zheng, Xufang Luo, Pengfei Wei, Xuan Song, Dongsheng Li, and Jing Jiang. Adaptive pol- icy learning for offline-to-online reinforcement learning. InProceedings of the AAAI Conference on Artificial Intel- ligence, volume 37, pages 11372–11380,

work page 2023
[36]

Efficient online rein- forcement learning fine-tuning need not retain offline data

[Zhouet al., 2025 ] Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, and Aviral Kumar. Efficient online rein- forcement learning fine-tuning need not retain offline data. InThe Thirteenth International Conference on Learning Representations,

work page 2025
[37]

Following the bi-level optimization theory [Petrulionyt˙eet al., 2024 ], we compute the gradient of the outer-level objective for a theoretical gradient ascent algorithm

A Solution to the Bi-Level Optimization As is demonstrated in Equation (3), we solve a bi-level optimization problem that selects an optimal data mixing strategy. Following the bi-level optimization theory [Petrulionyt˙eet al., 2024 ], we compute the gradient of the outer-level objective for a theoretical gradient ascent algorithm. Although we are solving...

work page 2024
[38]

The second assumption conveys the critical intuition that the data mixing strategy is a tradeoff between utilizing offline and online data

The first assumption holds for most of the practical online RL algorithms like max-entropy RL [Haarnojaet al., 2018 ] and trust region methods [Schulmanet al., 2015 ]. The second assumption conveys the critical intuition that the data mixing strategy is a tradeoff between utilizing offline and online data. With these assumptions, we simplify the gradient ...

work page 2018
[39]

Z A πf(a) + Z A δπf(a) δ ˆf(a ′) ϵ(a′)da′ ! f(a)da−E πk [f] # = Eπf [f]−E πk [f] +E ϵ

The task is to compare the overestimation bias term and the true advantage term. We approximate the true advantage term by a first-order Taylor expansion on the online policyπ. (In fact, the former approximation of the bias term is also equivalent to a first-order policy expansion.) This expansion eliminates the stochastic term within the true advantage, ...

work page arXiv 2002
[40]

For instance, a mixing ratio fixed at 0.3 shows promising overall results with Cal-QL across both types of environments, yet the same setting performs significantly weaker with CQL

It was observed that fixed mixing ratios exhibit a certain level of instability in these cases. For instance, a mixing ratio fixed at 0.3 shows promising overall results with Cal-QL across both types of environments, yet the same setting performs significantly weaker with CQL. Similarly, strategies such as decreasing mixing ratios and uniform selection am...

work page 2023
[41]

This observation highlights the importance of adaptive methods in offline-to-online reinforcement learning

Reflecting on the empirical results discussed in our paper, we observed that the effectiveness of various mixing ratios can be influenced by the quality of the offline data. This observation highlights the importance of adaptive methods in offline-to-online reinforcement learning. Integrating ROAD with the Proto algorithm, we found that it improves upon t...

work page 2023