COOPO: Cyclic Offline-Online Policy Optimization Algorithm

Aditya Balu; Cody Fleming; Joshua Russell Waite; Qisai Liu; Soumik Sarkar; Zhanhong Jiang

arxiv: 2605.18675 · v1 · pith:PEULEP3Jnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

Qisai Liu , Zhanhong Jiang , Joshua Russell Waite , Aditya Balu , Cody Fleming , Soumik Sarkar This is my paper

Pith reviewed 2026-05-20 13:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningoffline RLonline RLhybrid methodssample efficiencydistributional shiftpolicy optimization

0 comments

The pith

COOPO cycles between offline and online phases to achieve better sample efficiency than pure online RL with monotonic improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents COOPO as a framework for hybrid reinforcement learning that repeatedly cycles between offline and online training. Each cycle begins with constrained offline updates using KL regularization to anchor the policy to the dataset, followed by online fine-tuning. This design is intended to prevent distributional shift and catastrophic forgetting while maximizing the use of offline data and minimizing online interactions. A sympathetic reader would care because it promises more practical RL by lowering the number of costly environment interactions required for high performance.

Core claim

COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions by cycling between constrained offline training and online fine-tuning, where the offline phase uses KL-regularized advantage-weighted updates to eliminate drift and forgetting.

What carries the argument

The cyclic structure alternating between KL-regularized advantage-weighted offline policy updates to minimize distributional shift and online policy optimization for stable exploration.

If this is right

Surpasses pure online RL in online sample efficiency
Guarantees monotonic improvement under coverage assumptions
Reduces online environment interactions compared to state-of-the-art hybrids
Improves final returns on D4RL benchmarks
Remains robust across diverse offline algorithms and online optimizers

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This cyclic approach may help in developing more stable continual learning systems in RL.
Optimal cycle frequency could be explored to further optimize the balance between offline and online phases.
Applications in real-world robotics might benefit from reduced interaction needs with physical environments.

Load-bearing premise

That periodically returning to offline training on the dataset reliably eliminates distributional drift and catastrophic forgetting.

What would settle it

A test showing that COOPO does not reduce online interactions or fails to maintain monotonic improvement even when coverage assumptions are met.

Figures

Figures reproduced from arXiv: 2605.18675 by Aditya Balu, Cody Fleming, Joshua Russell Waite, Qisai Liu, Soumik Sarkar, Zhanhong Jiang.

**Figure 2.** Figure 2: Schematic Diagram of COOPO: COOPO cyclically [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Mujoco environments used in this work for evaluation: [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Average return vs. gradient steps on HalfCheetah for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Training performance across different online-offline [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Training performance against trajectory number between [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Performance comparison in the HalfCheetah environment for different λ values, averaged over multiple seeds [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COOPO's cyclic hybrid framework is a reasonable practical idea for cutting online interactions in RL, but the monotonic improvement guarantee looks shaky because coverage after the online phase is not shown to hold.

read the letter

The main point is that COOPO cycles between a KL-regularized offline anchor and online fine-tuning, then loops back to the offline step to limit drift and forgetting. This is presented as a generalized framework that reduces online samples while keeping performance up on D4RL tasks compared with other hybrids. The periodic return to the dataset is the clearest new piece; it directly targets the forgetting problem that one-shot offline-to-online methods often hit. The benchmarks test the method with several offline algorithms and online optimizers, which is a useful check for robustness, and the reported drop in online interactions is the kind of practical result people in applied RL care about. Credit for shipping those comparisons and for keeping the framework modular enough that existing components can slot in. The soft spot is the theory. The abstract claims guaranteed monotonic improvement under standard coverage assumptions for each cycle. The stress-test concern lands: after the online fine-tuning step, nothing in the description shows that the coverage coefficient stays bounded for the next offline update. If the online phase expands support away from the fixed dataset, the KL-regularized advantage-weighted step could lose its improvement bound or even go negative. The paper invokes the assumption per cycle but does not appear to derive preservation across the loop, so the central guarantee is not yet solid. Experiments are standard for the area but would benefit from clearer reporting on how interaction counts were measured and whether gains hold under different random seeds. This paper is aimed at people working on hybrid RL for settings where online data is expensive, such as robotics or control with real hardware. A reader who wants a simple cyclic recipe to try on top of existing offline and online code would get value from the framework and the benchmark numbers. It deserves a serious referee to check the coverage argument and the experimental details. I would send it for peer review with a request to tighten the theoretical section on cross-cycle coverage.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces COOPO, a cyclic offline-online RL framework that alternates between KL-regularized advantage-weighted offline updates (to anchor the policy and minimize distributional shift) and online fine-tuning with arbitrary policy optimizers. It claims superior online sample efficiency over pure online RL, guaranteed monotonic improvement across cycles under standard coverage assumptions, and empirical gains on D4RL benchmarks including reduced online interactions and higher returns while remaining robust to choice of offline and online components.

Significance. If the monotonic improvement guarantee holds and the cyclic structure demonstrably preserves coverage while reducing online samples, the work would advance hybrid RL by providing a general, reusable framework that mitigates drift and forgetting; the reported robustness across algorithm choices would be a practical strength.

major comments (1)

[Section 3] Section 3 (Theoretical Analysis), monotonic improvement claim: the guarantee of monotonic improvement across the full cyclic trajectory rests on the coverage assumption (e.g., concentrability or single-policy coverage w.r.t. the fixed offline dataset) remaining valid after each online fine-tuning phase so that the subsequent KL-regularized offline update can again deliver a non-negative improvement bound. The manuscript invokes the assumption for each offline step but does not derive or bound how the online optimizer affects the coverage coefficient; if the online phase increases support mismatch, the next offline improvement bound can degrade or become negative, which is load-bearing for the central theoretical claim of monotonicity over multiple cycles.

minor comments (1)

[Abstract] Abstract: the phrase 'standard coverage assumptions' is used without specifying the precise form (e.g., concentrability coefficient bound or single-policy concentrability) employed in the proofs; adding this would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. The major comment on the theoretical analysis in Section 3 raises an important point about sustaining the coverage assumption across cycles, which we address directly below. We believe incorporating the requested clarification will strengthen the presentation of the monotonic improvement result.

read point-by-point responses

Referee: [Section 3] Section 3 (Theoretical Analysis), monotonic improvement claim: the guarantee of monotonic improvement across the full cyclic trajectory rests on the coverage assumption (e.g., concentrability or single-policy coverage w.r.t. the fixed offline dataset) remaining valid after each online fine-tuning phase so that the subsequent KL-regularized offline update can again deliver a non-negative improvement bound. The manuscript invokes the assumption for each offline step but does not derive or bound how the online optimizer affects the coverage coefficient; if the online phase increases support mismatch, the next offline improvement bound can degrade or become negative, which is load-bearing for the central theoretical claim of monotonicity over multiple cycles.

Authors: We appreciate the referee's careful reading and agree that an explicit treatment of how the coverage coefficient evolves after the online phase is necessary to rigorously close the monotonicity argument over multiple cycles. The current manuscript states the standard coverage assumption (concentrability or single-policy coverage with respect to the fixed offline dataset) at the beginning of each offline update but does not derive a bound on its possible growth during online fine-tuning. In the revised manuscript we will add a short supporting lemma in Section 3 that bounds the change in the coverage coefficient after a finite number of online steps. Under the mild additional assumption that the online optimizer employs a trust-region constraint (KL divergence between consecutive policies bounded by a small constant), the concentrability coefficient can increase by at most a multiplicative factor that depends on the trust-region radius and the number of online steps. With this bound in hand, the improvement delivered by the subsequent KL-regularized offline step remains non-negative, thereby preserving the overall monotonicity guarantee across cycles. We will also add a brief remark clarifying that the cyclic structure itself helps control drift because each offline anchoring step resets the policy toward the dataset support. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical claims invoke external standard assumptions without reduction to inputs

full rationale

The paper's central theoretical claim of guaranteed monotonic improvement and better sample efficiency is explicitly conditioned on 'standard coverage assumptions' from the broader RL literature. These assumptions are treated as given external inputs rather than derived from or redefined in terms of the COOPO cyclic structure. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description that would reduce the claimed guarantees to the method's own outputs by construction. The cyclic offline-online framework is presented as an independent algorithmic contribution that leverages (but does not presuppose) those assumptions, with empirical validation on D4RL benchmarks providing separate support. This qualifies as a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest primarily on standard RL coverage assumptions for the theoretical guarantees and on the effectiveness of the described cyclic structure; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption standard coverage assumptions
Invoked to support the guaranteed monotonic improvement and better sample efficiency claims.

pith-pipeline@v0.9.0 · 5732 in / 1245 out tokens · 47556 ms · 2026-05-20T13:06:44.792639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 10 internal anchors

[1]

Safe learning in robotics: From learning-based control to safe reinforcement learning,

L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022

work page 2022
[2]

Reinforcement learning in robotic applications: a comprehensive survey,

B. Singh, R. Kumar, and V . P. Singh, “Reinforcement learning in robotic applications: a comprehensive survey,”Artificial Intelligence Review, vol. 55, no. 2, pp. 945–990, 2022

work page 2022
[3]

Deep reinforcement learning in smart manufacturing: A review and prospects,

C. Li, P. Zheng, Y . Yin, B. Wang, and L. Wang, “Deep reinforcement learning in smart manufacturing: A review and prospects,”CIRP Journal of Manufacturing Science and Technology, vol. 40, pp. 75–101, 2023

work page 2023
[4]

Deep reinforcement learning in medical imaging: A literature review,

S. K. Zhou, H. N. Le, K. Luu, H. V . Nguyen, and N. Ayache, “Deep reinforcement learning in medical imaging: A literature review,”Medical image analysis, vol. 73, p. 102193, 2021

work page 2021
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeelet al., “Soft actor-critic algorithms and applications,”arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

A survey on offline reinforcement learning: Taxonomy, review, and open problems,

R. F. Prudencio, M. R. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Transactions on Neural Networks and Learning Systems, 2023

work page 2023
[9]

Sample efficient offline-to-online reinforcement learning,

S. Guo, L. Zou, H. Chen, B. Qu, H. Chi, P. S. Yu, and Y . Chang, “Sample efficient offline-to-online reinforcement learning,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 3, pp. 1299–1310, 2023

work page 2023
[10]

Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718,

Y . Song, Y . Zhou, A. Sekhari, J. A. Bagnell, A. Krishnamurthy, and W. Sun, “Hybrid rl: Using both offline and online data can make rl efficient,”arXiv preprint arXiv:2210.06718, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

work page arXiv 2022
[11]

Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions

Y . Luo, J. Kay, E. Grefenstette, and M. P. Deisenroth, “Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions,”arXiv preprint arXiv:2303.17396, 2023

work page arXiv 2023
[12]

Efficient online reinforcement learning with offline data,

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Efficient online reinforcement learning with offline data,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1577–1594

work page 2023
[13]

Efficient online reinforcement learning fine-tuning need not retain offline data,

Z. Zhou, A. Peng, Q. Li, S. Levine, and A. Kumar, “Efficient online reinforcement learning fine-tuning need not retain offline data,”arXiv preprint arXiv:2412.07762, 2024

work page arXiv 2024
[14]

Deep reinforcement learning for transporta- tion network combinatorial optimization: A survey,

Q. Wang and C. Tang, “Deep reinforcement learning for transporta- tion network combinatorial optimization: A survey,”Knowledge-Based Systems, vol. 233, p. 107526, 2021

work page 2021
[15]

Reinforcement learning in artificial and biological systems,

E. O. Neftci and B. B. Averbeck, “Reinforcement learning in artificial and biological systems,”Nature Machine Intelligence, vol. 1, no. 3, pp. 133–143, 2019

work page 2019
[16]

Conservative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”Advances in neural information processing systems, vol. 33, pp. 1179–1191, 2020

work page 2020
[17]

Offline Reinforcement Learning with Implicit Q-Learning

I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning,”arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Behavior Regularized Offline Reinforcement Learning

Y . Wu, G. Tucker, and O. Nachum, “Behavior regularized offline reinforcement learning,”arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[19]

Off-policy deep reinforcement learning without exploration,

S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” inInternational conference on machine learning. PMLR, 2019, pp. 2052–2062

work page 2019
[20]

Offline rl policies should be trained to be adaptive,

D. Ghosh, A. Ajay, P. Agrawal, and S. Levine, “Offline rl policies should be trained to be adaptive,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 7513–7530

work page 2022
[21]

Behavior proximal policy optimization,

Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y . Guo, “Behavior proximal policy optimization,”arXiv preprint arXiv:2302.11312, 2023

work page arXiv 2023
[22]

Morel: Model-based offline reinforcement learning,

R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, “Morel: Model-based offline reinforcement learning,”Advances in neural infor- mation processing systems, vol. 33, pp. 21 810–21 823, 2020

work page 2020
[23]

Mopo: Model-based offline policy optimization,

T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y . Zou, S. Levine, C. Finn, and T. Ma, “Mopo: Model-based offline policy optimization,”Advances in Neural Information Processing Systems, vol. 33, pp. 14 129–14 142, 2020

work page 2020
[24]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

A. Nair, A. Gupta, M. Dalal, and S. Levine, “Awac: Accelerating online reinforcement learning with offline datasets,”arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[25]

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,

M. Nakamoto, S. Zhai, A. Singh, M. Sobol Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine, “Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,”Advances in Neural Information Processing Systems, vol. 36, pp. 62 244–62 269, 2023

work page 2023
[26]

Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization.arXiv preprint arXiv:2311.03351,

K. Lei, Z. He, C. Lu, K. Hu, Y . Gao, and H. Xu, “Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization,”arXiv preprint arXiv:2311.03351, 2023

work page arXiv 2023
[27]

Adaptive policy learning for offline-to-online reinforcement learning,

H. Zheng, X. Luo, P. Wei, X. Song, D. Li, and J. Jiang, “Adaptive policy learning for offline-to-online reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, 2023, pp. 11 372–11 380

work page 2023
[28]

Efficient online rl fine tuning with offline pre-trained policy only,

W. Xiao, J. Liu, Z. Zhuang, R. Suo, S. Lyu, and D. Wang, “Efficient online rl fine tuning with offline pre-trained policy only,”arXiv preprint arXiv:2505.16856, 2025

work page arXiv 2025
[29]

Improving offline- to-online reinforcement learning with q-ensembles,

K. Zhao, Y . Ma, J. Liu, J. Hao, Y . Zheng, and Z. Meng, “Improving offline- to-online reinforcement learning with q-ensembles,” inICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023

work page 2023
[30]

Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl,

Q.-W. Luo, M.-K. Xie, Y . Wang, and S.-J. Huang, “Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl,”Advances in Neural Information Processing Systems, vol. 37, pp. 108 167–108 207, 2024

work page 2024
[31]

Augmenting online rl with offline data is all you need: A unified hybrid rl algorithm design and analysis,

R. Huang, D. Li, C. Shi, C. Shen, and J. Yang, “Augmenting online rl with offline data is all you need: A unified hybrid rl algorithm design and analysis,”arXiv preprint arXiv:2505.13768, 2025

work page arXiv 2025
[32]

Reinforcement Learning with Action Chunking

Q. Li, Z. Zhou, and S. Levine, “Reinforcement learning with action chunking,”arXiv preprint arXiv:2507.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Ppo-clip attains global optimality: Towards deeper understandings of clipping,

N.-C. Huang, P.-C. Hsieh, K.-H. Ho, and I.-C. Wu, “Ppo-clip attains global optimality: Towards deeper understandings of clipping,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 600–12 607

work page 2024
[34]

Trust region policy optimization,

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

work page 2015
[35]

Continuous control with deep reinforcement learning

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[36]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

X. B. Peng, A. Kumar, G. Zhang, and S. Levine, “Advantage-weighted regression: Simple and scalable off-policy reinforcement learning,”arXiv preprint arXiv:1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[37]

H. K. Khalil,Nonlinear Systems. Prentice Hall, 2002

work page 2002
[38]

K. J. Åström and B. Wittenmark,Adaptive Control, 2nd ed. Addison- Wesley, 1995

work page 1995
[39]

T. M. Cover,Elements of information theory. John Wiley & Sons, 1999

work page 1999
[40]

Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,

T. Xie, N. Jiang, H. Wang, C. Xiong, and Y . Bai, “Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,” Advances in neural information processing systems, vol. 34, pp. 27 395– 27 407, 2021

work page 2021
[41]

Bridging offline reinforcement learning and imitation learning: A tale of pessimism,

P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell, “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 702– 11 716, 2021

work page 2021
[42]

Approximately optimal approximate reinforcement learning,

S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” inProceedings of the nineteenth international conference on machine learning, 2002, pp. 267–274

work page 2002
[43]

Constrained policy optimization,

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inInternational conference on machine learning. PMLR, 2017, pp. 22–31

work page 2017
[44]

Refinements of pinsker’s inequality,

A. A. Fedotov, P. Harremoës, and F. Topsoe, “Refinements of pinsker’s inequality,”IEEE Transactions on Information Theory, vol. 49, no. 6, pp. 1491–1498, 2003

work page 2003
[45]

Z., Balasubramanian, K., Chewi, S., and Salim, A

G. Garrigos and R. M. Gower, “Handbook of convergence theorems for (stochastic) gradient methods,”arXiv preprint arXiv:2301.11235, 2023

work page arXiv 2023
[46]

On finite-time convergence of actor-critic algorithm,

S. Qiu, Z. Yang, J. Ye, and Z. Wang, “On finite-time convergence of actor-critic algorithm,”IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 2, pp. 652–664, 2021

work page 2021
[47]

A general sample complexity analysis of vanilla policy gradient,

R. Yuan, R. M. Gower, and A. Lazaric, “A general sample complexity analysis of vanilla policy gradient,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 3332–3380

work page 2022
[48]

Improving sample complexity bounds for (natural) actor-critic algorithms,

T. Xu, Z. Wang, and Y . Liang, “Improving sample complexity bounds for (natural) actor-critic algorithms,”Advances in Neural Information Processing Systems, vol. 33, pp. 4358–4369, 2020

work page 2020
[49]

Mujoco: A physics engine for model- based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model- based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

work page 2012
[50]

Behavioral Cloning from Observation

F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” arXiv preprint arXiv:1805.01954, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[51]

A connection between one-step rl and critic regularization in reinforcement learning,

B. Eysenbach, M. Geist, S. Levine, and R. Salakhutdinov, “A connection between one-step rl and critic regularization in reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 9485–9507

work page 2023
[52]

A minimalist approach to offline reinforcement learning,

S. Fujimoto and S. S. Gu, “A minimalist approach to offline reinforcement learning,”Advances in neural information processing systems, vol. 34, pp. 20 132–20 145, 2021

work page 2021
[53]

Online decision transformer,

Q. Zheng, A. Zhang, and A. Grover, “Online decision transformer,” ininternational conference on machine learning. PMLR, 2022, pp. 27 042–27 059

work page 2022
[54]

Policy expansion for bridging offline-to- online reinforcement learning,

H. Zhang, W. Xu, and H. Yu, “Policy expansion for bridging offline-to- online reinforcement learning,”arXiv preprint arXiv:2302.00935, 2023

work page arXiv 2023
[55]

Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,

S. Lee, Y . Seo, K. Lee, P. Abbeel, and J. Shin, “Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,” inConference on Robot Learning. PMLR, 2022, pp. 1702–1712

work page 2022

[1] [1]

Safe learning in robotics: From learning-based control to safe reinforcement learning,

L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022

work page 2022

[2] [2]

Reinforcement learning in robotic applications: a comprehensive survey,

B. Singh, R. Kumar, and V . P. Singh, “Reinforcement learning in robotic applications: a comprehensive survey,”Artificial Intelligence Review, vol. 55, no. 2, pp. 945–990, 2022

work page 2022

[3] [3]

Deep reinforcement learning in smart manufacturing: A review and prospects,

C. Li, P. Zheng, Y . Yin, B. Wang, and L. Wang, “Deep reinforcement learning in smart manufacturing: A review and prospects,”CIRP Journal of Manufacturing Science and Technology, vol. 40, pp. 75–101, 2023

work page 2023

[4] [4]

Deep reinforcement learning in medical imaging: A literature review,

S. K. Zhou, H. N. Le, K. Luu, H. V . Nguyen, and N. Ayache, “Deep reinforcement learning in medical imaging: A literature review,”Medical image analysis, vol. 73, p. 102193, 2021

work page 2021

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeelet al., “Soft actor-critic algorithms and applications,”arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

A survey on offline reinforcement learning: Taxonomy, review, and open problems,

R. F. Prudencio, M. R. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Transactions on Neural Networks and Learning Systems, 2023

work page 2023

[9] [9]

Sample efficient offline-to-online reinforcement learning,

S. Guo, L. Zou, H. Chen, B. Qu, H. Chi, P. S. Yu, and Y . Chang, “Sample efficient offline-to-online reinforcement learning,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 3, pp. 1299–1310, 2023

work page 2023

[10] [10]

Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718,

Y . Song, Y . Zhou, A. Sekhari, J. A. Bagnell, A. Krishnamurthy, and W. Sun, “Hybrid rl: Using both offline and online data can make rl efficient,”arXiv preprint arXiv:2210.06718, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

work page arXiv 2022

[11] [11]

Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions

Y . Luo, J. Kay, E. Grefenstette, and M. P. Deisenroth, “Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions,”arXiv preprint arXiv:2303.17396, 2023

work page arXiv 2023

[12] [12]

Efficient online reinforcement learning with offline data,

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Efficient online reinforcement learning with offline data,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1577–1594

work page 2023

[13] [13]

Efficient online reinforcement learning fine-tuning need not retain offline data,

Z. Zhou, A. Peng, Q. Li, S. Levine, and A. Kumar, “Efficient online reinforcement learning fine-tuning need not retain offline data,”arXiv preprint arXiv:2412.07762, 2024

work page arXiv 2024

[14] [14]

Deep reinforcement learning for transporta- tion network combinatorial optimization: A survey,

Q. Wang and C. Tang, “Deep reinforcement learning for transporta- tion network combinatorial optimization: A survey,”Knowledge-Based Systems, vol. 233, p. 107526, 2021

work page 2021

[15] [15]

Reinforcement learning in artificial and biological systems,

E. O. Neftci and B. B. Averbeck, “Reinforcement learning in artificial and biological systems,”Nature Machine Intelligence, vol. 1, no. 3, pp. 133–143, 2019

work page 2019

[16] [16]

Conservative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”Advances in neural information processing systems, vol. 33, pp. 1179–1191, 2020

work page 2020

[17] [17]

Offline Reinforcement Learning with Implicit Q-Learning

I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning,”arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Behavior Regularized Offline Reinforcement Learning

Y . Wu, G. Tucker, and O. Nachum, “Behavior regularized offline reinforcement learning,”arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[19] [19]

Off-policy deep reinforcement learning without exploration,

S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” inInternational conference on machine learning. PMLR, 2019, pp. 2052–2062

work page 2019

[20] [20]

Offline rl policies should be trained to be adaptive,

D. Ghosh, A. Ajay, P. Agrawal, and S. Levine, “Offline rl policies should be trained to be adaptive,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 7513–7530

work page 2022

[21] [21]

Behavior proximal policy optimization,

Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y . Guo, “Behavior proximal policy optimization,”arXiv preprint arXiv:2302.11312, 2023

work page arXiv 2023

[22] [22]

Morel: Model-based offline reinforcement learning,

R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, “Morel: Model-based offline reinforcement learning,”Advances in neural infor- mation processing systems, vol. 33, pp. 21 810–21 823, 2020

work page 2020

[23] [23]

Mopo: Model-based offline policy optimization,

T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y . Zou, S. Levine, C. Finn, and T. Ma, “Mopo: Model-based offline policy optimization,”Advances in Neural Information Processing Systems, vol. 33, pp. 14 129–14 142, 2020

work page 2020

[24] [24]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

A. Nair, A. Gupta, M. Dalal, and S. Levine, “Awac: Accelerating online reinforcement learning with offline datasets,”arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[25] [25]

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,

M. Nakamoto, S. Zhai, A. Singh, M. Sobol Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine, “Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,”Advances in Neural Information Processing Systems, vol. 36, pp. 62 244–62 269, 2023

work page 2023

[26] [26]

Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization.arXiv preprint arXiv:2311.03351,

K. Lei, Z. He, C. Lu, K. Hu, Y . Gao, and H. Xu, “Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization,”arXiv preprint arXiv:2311.03351, 2023

work page arXiv 2023

[27] [27]

Adaptive policy learning for offline-to-online reinforcement learning,

H. Zheng, X. Luo, P. Wei, X. Song, D. Li, and J. Jiang, “Adaptive policy learning for offline-to-online reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, 2023, pp. 11 372–11 380

work page 2023

[28] [28]

Efficient online rl fine tuning with offline pre-trained policy only,

W. Xiao, J. Liu, Z. Zhuang, R. Suo, S. Lyu, and D. Wang, “Efficient online rl fine tuning with offline pre-trained policy only,”arXiv preprint arXiv:2505.16856, 2025

work page arXiv 2025

[29] [29]

Improving offline- to-online reinforcement learning with q-ensembles,

K. Zhao, Y . Ma, J. Liu, J. Hao, Y . Zheng, and Z. Meng, “Improving offline- to-online reinforcement learning with q-ensembles,” inICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023

work page 2023

[30] [30]

Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl,

Q.-W. Luo, M.-K. Xie, Y . Wang, and S.-J. Huang, “Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl,”Advances in Neural Information Processing Systems, vol. 37, pp. 108 167–108 207, 2024

work page 2024

[31] [31]

Augmenting online rl with offline data is all you need: A unified hybrid rl algorithm design and analysis,

R. Huang, D. Li, C. Shi, C. Shen, and J. Yang, “Augmenting online rl with offline data is all you need: A unified hybrid rl algorithm design and analysis,”arXiv preprint arXiv:2505.13768, 2025

work page arXiv 2025

[32] [32]

Reinforcement Learning with Action Chunking

Q. Li, Z. Zhou, and S. Levine, “Reinforcement learning with action chunking,”arXiv preprint arXiv:2507.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Ppo-clip attains global optimality: Towards deeper understandings of clipping,

N.-C. Huang, P.-C. Hsieh, K.-H. Ho, and I.-C. Wu, “Ppo-clip attains global optimality: Towards deeper understandings of clipping,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 600–12 607

work page 2024

[34] [34]

Trust region policy optimization,

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

work page 2015

[35] [35]

Continuous control with deep reinforcement learning

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[36] [36]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

X. B. Peng, A. Kumar, G. Zhang, and S. Levine, “Advantage-weighted regression: Simple and scalable off-policy reinforcement learning,”arXiv preprint arXiv:1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[37] [37]

H. K. Khalil,Nonlinear Systems. Prentice Hall, 2002

work page 2002

[38] [38]

K. J. Åström and B. Wittenmark,Adaptive Control, 2nd ed. Addison- Wesley, 1995

work page 1995

[39] [39]

T. M. Cover,Elements of information theory. John Wiley & Sons, 1999

work page 1999

[40] [40]

Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,

T. Xie, N. Jiang, H. Wang, C. Xiong, and Y . Bai, “Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,” Advances in neural information processing systems, vol. 34, pp. 27 395– 27 407, 2021

work page 2021

[41] [41]

Bridging offline reinforcement learning and imitation learning: A tale of pessimism,

P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell, “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 702– 11 716, 2021

work page 2021

[42] [42]

Approximately optimal approximate reinforcement learning,

S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” inProceedings of the nineteenth international conference on machine learning, 2002, pp. 267–274

work page 2002

[43] [43]

Constrained policy optimization,

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inInternational conference on machine learning. PMLR, 2017, pp. 22–31

work page 2017

[44] [44]

Refinements of pinsker’s inequality,

A. A. Fedotov, P. Harremoës, and F. Topsoe, “Refinements of pinsker’s inequality,”IEEE Transactions on Information Theory, vol. 49, no. 6, pp. 1491–1498, 2003

work page 2003

[45] [45]

Z., Balasubramanian, K., Chewi, S., and Salim, A

G. Garrigos and R. M. Gower, “Handbook of convergence theorems for (stochastic) gradient methods,”arXiv preprint arXiv:2301.11235, 2023

work page arXiv 2023

[46] [46]

On finite-time convergence of actor-critic algorithm,

S. Qiu, Z. Yang, J. Ye, and Z. Wang, “On finite-time convergence of actor-critic algorithm,”IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 2, pp. 652–664, 2021

work page 2021

[47] [47]

A general sample complexity analysis of vanilla policy gradient,

R. Yuan, R. M. Gower, and A. Lazaric, “A general sample complexity analysis of vanilla policy gradient,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 3332–3380

work page 2022

[48] [48]

Improving sample complexity bounds for (natural) actor-critic algorithms,

T. Xu, Z. Wang, and Y . Liang, “Improving sample complexity bounds for (natural) actor-critic algorithms,”Advances in Neural Information Processing Systems, vol. 33, pp. 4358–4369, 2020

work page 2020

[49] [49]

Mujoco: A physics engine for model- based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model- based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

work page 2012

[50] [50]

Behavioral Cloning from Observation

F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” arXiv preprint arXiv:1805.01954, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[51] [51]

A connection between one-step rl and critic regularization in reinforcement learning,

B. Eysenbach, M. Geist, S. Levine, and R. Salakhutdinov, “A connection between one-step rl and critic regularization in reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 9485–9507

work page 2023

[52] [52]

A minimalist approach to offline reinforcement learning,

S. Fujimoto and S. S. Gu, “A minimalist approach to offline reinforcement learning,”Advances in neural information processing systems, vol. 34, pp. 20 132–20 145, 2021

work page 2021

[53] [53]

Online decision transformer,

Q. Zheng, A. Zhang, and A. Grover, “Online decision transformer,” ininternational conference on machine learning. PMLR, 2022, pp. 27 042–27 059

work page 2022

[54] [54]

Policy expansion for bridging offline-to- online reinforcement learning,

H. Zhang, W. Xu, and H. Yu, “Policy expansion for bridging offline-to- online reinforcement learning,”arXiv preprint arXiv:2302.00935, 2023

work page arXiv 2023

[55] [55]

Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,

S. Lee, Y . Seo, K. Lee, P. Abbeel, and J. Shin, “Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,” inConference on Robot Learning. PMLR, 2022, pp. 1702–1712

work page 2022