COOPO: Cyclic Offline-Online Policy Optimization Algorithm
Pith reviewed 2026-05-20 13:06 UTC · model grok-4.3
The pith
COOPO cycles between offline and online phases to achieve better sample efficiency than pure online RL with monotonic improvement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions by cycling between constrained offline training and online fine-tuning, where the offline phase uses KL-regularized advantage-weighted updates to eliminate drift and forgetting.
What carries the argument
The cyclic structure alternating between KL-regularized advantage-weighted offline policy updates to minimize distributional shift and online policy optimization for stable exploration.
If this is right
- Surpasses pure online RL in online sample efficiency
- Guarantees monotonic improvement under coverage assumptions
- Reduces online environment interactions compared to state-of-the-art hybrids
- Improves final returns on D4RL benchmarks
- Remains robust across diverse offline algorithms and online optimizers
Where Pith is reading between the lines
- This cyclic approach may help in developing more stable continual learning systems in RL.
- Optimal cycle frequency could be explored to further optimize the balance between offline and online phases.
- Applications in real-world robotics might benefit from reduced interaction needs with physical environments.
Load-bearing premise
That periodically returning to offline training on the dataset reliably eliminates distributional drift and catastrophic forgetting.
What would settle it
A test showing that COOPO does not reduce online interactions or fails to maintain monotonic improvement even when coverage assumptions are met.
Figures
read the original abstract
Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces COOPO, a cyclic offline-online RL framework that alternates between KL-regularized advantage-weighted offline updates (to anchor the policy and minimize distributional shift) and online fine-tuning with arbitrary policy optimizers. It claims superior online sample efficiency over pure online RL, guaranteed monotonic improvement across cycles under standard coverage assumptions, and empirical gains on D4RL benchmarks including reduced online interactions and higher returns while remaining robust to choice of offline and online components.
Significance. If the monotonic improvement guarantee holds and the cyclic structure demonstrably preserves coverage while reducing online samples, the work would advance hybrid RL by providing a general, reusable framework that mitigates drift and forgetting; the reported robustness across algorithm choices would be a practical strength.
major comments (1)
- [Section 3] Section 3 (Theoretical Analysis), monotonic improvement claim: the guarantee of monotonic improvement across the full cyclic trajectory rests on the coverage assumption (e.g., concentrability or single-policy coverage w.r.t. the fixed offline dataset) remaining valid after each online fine-tuning phase so that the subsequent KL-regularized offline update can again deliver a non-negative improvement bound. The manuscript invokes the assumption for each offline step but does not derive or bound how the online optimizer affects the coverage coefficient; if the online phase increases support mismatch, the next offline improvement bound can degrade or become negative, which is load-bearing for the central theoretical claim of monotonicity over multiple cycles.
minor comments (1)
- [Abstract] Abstract: the phrase 'standard coverage assumptions' is used without specifying the precise form (e.g., concentrability coefficient bound or single-policy concentrability) employed in the proofs; adding this would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our manuscript. The major comment on the theoretical analysis in Section 3 raises an important point about sustaining the coverage assumption across cycles, which we address directly below. We believe incorporating the requested clarification will strengthen the presentation of the monotonic improvement result.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Theoretical Analysis), monotonic improvement claim: the guarantee of monotonic improvement across the full cyclic trajectory rests on the coverage assumption (e.g., concentrability or single-policy coverage w.r.t. the fixed offline dataset) remaining valid after each online fine-tuning phase so that the subsequent KL-regularized offline update can again deliver a non-negative improvement bound. The manuscript invokes the assumption for each offline step but does not derive or bound how the online optimizer affects the coverage coefficient; if the online phase increases support mismatch, the next offline improvement bound can degrade or become negative, which is load-bearing for the central theoretical claim of monotonicity over multiple cycles.
Authors: We appreciate the referee's careful reading and agree that an explicit treatment of how the coverage coefficient evolves after the online phase is necessary to rigorously close the monotonicity argument over multiple cycles. The current manuscript states the standard coverage assumption (concentrability or single-policy coverage with respect to the fixed offline dataset) at the beginning of each offline update but does not derive a bound on its possible growth during online fine-tuning. In the revised manuscript we will add a short supporting lemma in Section 3 that bounds the change in the coverage coefficient after a finite number of online steps. Under the mild additional assumption that the online optimizer employs a trust-region constraint (KL divergence between consecutive policies bounded by a small constant), the concentrability coefficient can increase by at most a multiplicative factor that depends on the trust-region radius and the number of online steps. With this bound in hand, the improvement delivered by the subsequent KL-regularized offline step remains non-negative, thereby preserving the overall monotonicity guarantee across cycles. We will also add a brief remark clarifying that the cyclic structure itself helps control drift because each offline anchoring step resets the policy toward the dataset support. revision: yes
Circularity Check
No circularity: theoretical claims invoke external standard assumptions without reduction to inputs
full rationale
The paper's central theoretical claim of guaranteed monotonic improvement and better sample efficiency is explicitly conditioned on 'standard coverage assumptions' from the broader RL literature. These assumptions are treated as given external inputs rather than derived from or redefined in terms of the COOPO cyclic structure. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description that would reduce the claimed guarantees to the method's own outputs by construction. The cyclic offline-online framework is presented as an independent algorithmic contribution that leverages (but does not presuppose) those assumptions, with empirical validation on D4RL benchmarks providing separate support. This qualifies as a self-contained derivation against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption standard coverage assumptions
Reference graph
Works this paper leans on
-
[1]
Safe learning in robotics: From learning-based control to safe reinforcement learning,
L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022
work page 2022
-
[2]
Reinforcement learning in robotic applications: a comprehensive survey,
B. Singh, R. Kumar, and V . P. Singh, “Reinforcement learning in robotic applications: a comprehensive survey,”Artificial Intelligence Review, vol. 55, no. 2, pp. 945–990, 2022
work page 2022
-
[3]
Deep reinforcement learning in smart manufacturing: A review and prospects,
C. Li, P. Zheng, Y . Yin, B. Wang, and L. Wang, “Deep reinforcement learning in smart manufacturing: A review and prospects,”CIRP Journal of Manufacturing Science and Technology, vol. 40, pp. 75–101, 2023
work page 2023
-
[4]
Deep reinforcement learning in medical imaging: A literature review,
S. K. Zhou, H. N. Le, K. Luu, H. V . Nguyen, and N. Ayache, “Deep reinforcement learning in medical imaging: A literature review,”Medical image analysis, vol. 73, p. 102193, 2021
work page 2021
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
Soft Actor-Critic Algorithms and Applications
T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeelet al., “Soft actor-critic algorithms and applications,”arXiv preprint arXiv:1812.05905, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
A survey on offline reinforcement learning: Taxonomy, review, and open problems,
R. F. Prudencio, M. R. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Transactions on Neural Networks and Learning Systems, 2023
work page 2023
-
[9]
Sample efficient offline-to-online reinforcement learning,
S. Guo, L. Zou, H. Chen, B. Qu, H. Chi, P. S. Yu, and Y . Chang, “Sample efficient offline-to-online reinforcement learning,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 3, pp. 1299–1310, 2023
work page 2023
-
[10]
Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718,
Y . Song, Y . Zhou, A. Sekhari, J. A. Bagnell, A. Krishnamurthy, and W. Sun, “Hybrid rl: Using both offline and online data can make rl efficient,”arXiv preprint arXiv:2210.06718, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
-
[11]
Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions
Y . Luo, J. Kay, E. Grefenstette, and M. P. Deisenroth, “Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions,”arXiv preprint arXiv:2303.17396, 2023
-
[12]
Efficient online reinforcement learning with offline data,
P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Efficient online reinforcement learning with offline data,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1577–1594
work page 2023
-
[13]
Efficient online reinforcement learning fine-tuning need not retain offline data,
Z. Zhou, A. Peng, Q. Li, S. Levine, and A. Kumar, “Efficient online reinforcement learning fine-tuning need not retain offline data,”arXiv preprint arXiv:2412.07762, 2024
-
[14]
Deep reinforcement learning for transporta- tion network combinatorial optimization: A survey,
Q. Wang and C. Tang, “Deep reinforcement learning for transporta- tion network combinatorial optimization: A survey,”Knowledge-Based Systems, vol. 233, p. 107526, 2021
work page 2021
-
[15]
Reinforcement learning in artificial and biological systems,
E. O. Neftci and B. B. Averbeck, “Reinforcement learning in artificial and biological systems,”Nature Machine Intelligence, vol. 1, no. 3, pp. 133–143, 2019
work page 2019
-
[16]
Conservative q-learning for offline reinforcement learning,
A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”Advances in neural information processing systems, vol. 33, pp. 1179–1191, 2020
work page 2020
-
[17]
Offline Reinforcement Learning with Implicit Q-Learning
I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning,”arXiv preprint arXiv:2110.06169, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Behavior Regularized Offline Reinforcement Learning
Y . Wu, G. Tucker, and O. Nachum, “Behavior regularized offline reinforcement learning,”arXiv preprint arXiv:1911.11361, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[19]
Off-policy deep reinforcement learning without exploration,
S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” inInternational conference on machine learning. PMLR, 2019, pp. 2052–2062
work page 2019
-
[20]
Offline rl policies should be trained to be adaptive,
D. Ghosh, A. Ajay, P. Agrawal, and S. Levine, “Offline rl policies should be trained to be adaptive,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 7513–7530
work page 2022
-
[21]
Behavior proximal policy optimization,
Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y . Guo, “Behavior proximal policy optimization,”arXiv preprint arXiv:2302.11312, 2023
-
[22]
Morel: Model-based offline reinforcement learning,
R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, “Morel: Model-based offline reinforcement learning,”Advances in neural infor- mation processing systems, vol. 33, pp. 21 810–21 823, 2020
work page 2020
-
[23]
Mopo: Model-based offline policy optimization,
T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y . Zou, S. Levine, C. Finn, and T. Ma, “Mopo: Model-based offline policy optimization,”Advances in Neural Information Processing Systems, vol. 33, pp. 14 129–14 142, 2020
work page 2020
-
[24]
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
A. Nair, A. Gupta, M. Dalal, and S. Levine, “Awac: Accelerating online reinforcement learning with offline datasets,”arXiv preprint arXiv:2006.09359, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[25]
Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,
M. Nakamoto, S. Zhai, A. Singh, M. Sobol Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine, “Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,”Advances in Neural Information Processing Systems, vol. 36, pp. 62 244–62 269, 2023
work page 2023
-
[26]
K. Lei, Z. He, C. Lu, K. Hu, Y . Gao, and H. Xu, “Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization,”arXiv preprint arXiv:2311.03351, 2023
-
[27]
Adaptive policy learning for offline-to-online reinforcement learning,
H. Zheng, X. Luo, P. Wei, X. Song, D. Li, and J. Jiang, “Adaptive policy learning for offline-to-online reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, 2023, pp. 11 372–11 380
work page 2023
-
[28]
Efficient online rl fine tuning with offline pre-trained policy only,
W. Xiao, J. Liu, Z. Zhuang, R. Suo, S. Lyu, and D. Wang, “Efficient online rl fine tuning with offline pre-trained policy only,”arXiv preprint arXiv:2505.16856, 2025
-
[29]
Improving offline- to-online reinforcement learning with q-ensembles,
K. Zhao, Y . Ma, J. Liu, J. Hao, Y . Zheng, and Z. Meng, “Improving offline- to-online reinforcement learning with q-ensembles,” inICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023
work page 2023
-
[30]
Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl,
Q.-W. Luo, M.-K. Xie, Y . Wang, and S.-J. Huang, “Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl,”Advances in Neural Information Processing Systems, vol. 37, pp. 108 167–108 207, 2024
work page 2024
-
[31]
R. Huang, D. Li, C. Shi, C. Shen, and J. Yang, “Augmenting online rl with offline data is all you need: A unified hybrid rl algorithm design and analysis,”arXiv preprint arXiv:2505.13768, 2025
-
[32]
Reinforcement Learning with Action Chunking
Q. Li, Z. Zhou, and S. Levine, “Reinforcement learning with action chunking,”arXiv preprint arXiv:2507.07969, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Ppo-clip attains global optimality: Towards deeper understandings of clipping,
N.-C. Huang, P.-C. Hsieh, K.-H. Ho, and I.-C. Wu, “Ppo-clip attains global optimality: Towards deeper understandings of clipping,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 600–12 607
work page 2024
-
[34]
Trust region policy optimization,
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897
work page 2015
-
[35]
Continuous control with deep reinforcement learning
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[36]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
X. B. Peng, A. Kumar, G. Zhang, and S. Levine, “Advantage-weighted regression: Simple and scalable off-policy reinforcement learning,”arXiv preprint arXiv:1910.00177, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[37]
H. K. Khalil,Nonlinear Systems. Prentice Hall, 2002
work page 2002
-
[38]
K. J. Åström and B. Wittenmark,Adaptive Control, 2nd ed. Addison- Wesley, 1995
work page 1995
-
[39]
T. M. Cover,Elements of information theory. John Wiley & Sons, 1999
work page 1999
-
[40]
Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,
T. Xie, N. Jiang, H. Wang, C. Xiong, and Y . Bai, “Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,” Advances in neural information processing systems, vol. 34, pp. 27 395– 27 407, 2021
work page 2021
-
[41]
Bridging offline reinforcement learning and imitation learning: A tale of pessimism,
P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell, “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 702– 11 716, 2021
work page 2021
-
[42]
Approximately optimal approximate reinforcement learning,
S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” inProceedings of the nineteenth international conference on machine learning, 2002, pp. 267–274
work page 2002
-
[43]
Constrained policy optimization,
J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inInternational conference on machine learning. PMLR, 2017, pp. 22–31
work page 2017
-
[44]
Refinements of pinsker’s inequality,
A. A. Fedotov, P. Harremoës, and F. Topsoe, “Refinements of pinsker’s inequality,”IEEE Transactions on Information Theory, vol. 49, no. 6, pp. 1491–1498, 2003
work page 2003
-
[45]
Z., Balasubramanian, K., Chewi, S., and Salim, A
G. Garrigos and R. M. Gower, “Handbook of convergence theorems for (stochastic) gradient methods,”arXiv preprint arXiv:2301.11235, 2023
-
[46]
On finite-time convergence of actor-critic algorithm,
S. Qiu, Z. Yang, J. Ye, and Z. Wang, “On finite-time convergence of actor-critic algorithm,”IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 2, pp. 652–664, 2021
work page 2021
-
[47]
A general sample complexity analysis of vanilla policy gradient,
R. Yuan, R. M. Gower, and A. Lazaric, “A general sample complexity analysis of vanilla policy gradient,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 3332–3380
work page 2022
-
[48]
Improving sample complexity bounds for (natural) actor-critic algorithms,
T. Xu, Z. Wang, and Y . Liang, “Improving sample complexity bounds for (natural) actor-critic algorithms,”Advances in Neural Information Processing Systems, vol. 33, pp. 4358–4369, 2020
work page 2020
-
[49]
Mujoco: A physics engine for model- based control,
E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model- based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033
work page 2012
-
[50]
Behavioral Cloning from Observation
F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” arXiv preprint arXiv:1805.01954, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[51]
A connection between one-step rl and critic regularization in reinforcement learning,
B. Eysenbach, M. Geist, S. Levine, and R. Salakhutdinov, “A connection between one-step rl and critic regularization in reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 9485–9507
work page 2023
-
[52]
A minimalist approach to offline reinforcement learning,
S. Fujimoto and S. S. Gu, “A minimalist approach to offline reinforcement learning,”Advances in neural information processing systems, vol. 34, pp. 20 132–20 145, 2021
work page 2021
-
[53]
Q. Zheng, A. Zhang, and A. Grover, “Online decision transformer,” ininternational conference on machine learning. PMLR, 2022, pp. 27 042–27 059
work page 2022
-
[54]
Policy expansion for bridging offline-to- online reinforcement learning,
H. Zhang, W. Xu, and H. Yu, “Policy expansion for bridging offline-to- online reinforcement learning,”arXiv preprint arXiv:2302.00935, 2023
-
[55]
Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,
S. Lee, Y . Seo, K. Lee, P. Abbeel, and J. Shin, “Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,” inConference on Robot Learning. PMLR, 2022, pp. 1702–1712
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.