pith. sign in

arxiv: 2605.18675 · v1 · pith:PEULEP3Jnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

Pith reviewed 2026-05-20 13:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningoffline RLonline RLhybrid methodssample efficiencydistributional shiftpolicy optimization
0
0 comments X

The pith

COOPO cycles between offline and online phases to achieve better sample efficiency than pure online RL with monotonic improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents COOPO as a framework for hybrid reinforcement learning that repeatedly cycles between offline and online training. Each cycle begins with constrained offline updates using KL regularization to anchor the policy to the dataset, followed by online fine-tuning. This design is intended to prevent distributional shift and catastrophic forgetting while maximizing the use of offline data and minimizing online interactions. A sympathetic reader would care because it promises more practical RL by lowering the number of costly environment interactions required for high performance.

Core claim

COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions by cycling between constrained offline training and online fine-tuning, where the offline phase uses KL-regularized advantage-weighted updates to eliminate drift and forgetting.

What carries the argument

The cyclic structure alternating between KL-regularized advantage-weighted offline policy updates to minimize distributional shift and online policy optimization for stable exploration.

If this is right

  • Surpasses pure online RL in online sample efficiency
  • Guarantees monotonic improvement under coverage assumptions
  • Reduces online environment interactions compared to state-of-the-art hybrids
  • Improves final returns on D4RL benchmarks
  • Remains robust across diverse offline algorithms and online optimizers

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This cyclic approach may help in developing more stable continual learning systems in RL.
  • Optimal cycle frequency could be explored to further optimize the balance between offline and online phases.
  • Applications in real-world robotics might benefit from reduced interaction needs with physical environments.

Load-bearing premise

That periodically returning to offline training on the dataset reliably eliminates distributional drift and catastrophic forgetting.

What would settle it

A test showing that COOPO does not reduce online interactions or fails to maintain monotonic improvement even when coverage assumptions are met.

Figures

Figures reproduced from arXiv: 2605.18675 by Aditya Balu, Cody Fleming, Joshua Russell Waite, Qisai Liu, Soumik Sarkar, Zhanhong Jiang.

Figure 1
Figure 1. Figure 1: Cyclic offline-online policy optimization: Traditional [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic Diagram of COOPO: COOPO cyclically [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mujoco environments used in this work for evaluation: [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average return vs. gradient steps on HalfCheetah for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training performance across different online-offline [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training performance against trajectory number between [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance comparison in the HalfCheetah environ￾ment for different λ values, averaged over multiple seeds [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces COOPO, a cyclic offline-online RL framework that alternates between KL-regularized advantage-weighted offline updates (to anchor the policy and minimize distributional shift) and online fine-tuning with arbitrary policy optimizers. It claims superior online sample efficiency over pure online RL, guaranteed monotonic improvement across cycles under standard coverage assumptions, and empirical gains on D4RL benchmarks including reduced online interactions and higher returns while remaining robust to choice of offline and online components.

Significance. If the monotonic improvement guarantee holds and the cyclic structure demonstrably preserves coverage while reducing online samples, the work would advance hybrid RL by providing a general, reusable framework that mitigates drift and forgetting; the reported robustness across algorithm choices would be a practical strength.

major comments (1)
  1. [Section 3] Section 3 (Theoretical Analysis), monotonic improvement claim: the guarantee of monotonic improvement across the full cyclic trajectory rests on the coverage assumption (e.g., concentrability or single-policy coverage w.r.t. the fixed offline dataset) remaining valid after each online fine-tuning phase so that the subsequent KL-regularized offline update can again deliver a non-negative improvement bound. The manuscript invokes the assumption for each offline step but does not derive or bound how the online optimizer affects the coverage coefficient; if the online phase increases support mismatch, the next offline improvement bound can degrade or become negative, which is load-bearing for the central theoretical claim of monotonicity over multiple cycles.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'standard coverage assumptions' is used without specifying the precise form (e.g., concentrability coefficient bound or single-policy concentrability) employed in the proofs; adding this would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. The major comment on the theoretical analysis in Section 3 raises an important point about sustaining the coverage assumption across cycles, which we address directly below. We believe incorporating the requested clarification will strengthen the presentation of the monotonic improvement result.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Theoretical Analysis), monotonic improvement claim: the guarantee of monotonic improvement across the full cyclic trajectory rests on the coverage assumption (e.g., concentrability or single-policy coverage w.r.t. the fixed offline dataset) remaining valid after each online fine-tuning phase so that the subsequent KL-regularized offline update can again deliver a non-negative improvement bound. The manuscript invokes the assumption for each offline step but does not derive or bound how the online optimizer affects the coverage coefficient; if the online phase increases support mismatch, the next offline improvement bound can degrade or become negative, which is load-bearing for the central theoretical claim of monotonicity over multiple cycles.

    Authors: We appreciate the referee's careful reading and agree that an explicit treatment of how the coverage coefficient evolves after the online phase is necessary to rigorously close the monotonicity argument over multiple cycles. The current manuscript states the standard coverage assumption (concentrability or single-policy coverage with respect to the fixed offline dataset) at the beginning of each offline update but does not derive a bound on its possible growth during online fine-tuning. In the revised manuscript we will add a short supporting lemma in Section 3 that bounds the change in the coverage coefficient after a finite number of online steps. Under the mild additional assumption that the online optimizer employs a trust-region constraint (KL divergence between consecutive policies bounded by a small constant), the concentrability coefficient can increase by at most a multiplicative factor that depends on the trust-region radius and the number of online steps. With this bound in hand, the improvement delivered by the subsequent KL-regularized offline step remains non-negative, thereby preserving the overall monotonicity guarantee across cycles. We will also add a brief remark clarifying that the cyclic structure itself helps control drift because each offline anchoring step resets the policy toward the dataset support. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical claims invoke external standard assumptions without reduction to inputs

full rationale

The paper's central theoretical claim of guaranteed monotonic improvement and better sample efficiency is explicitly conditioned on 'standard coverage assumptions' from the broader RL literature. These assumptions are treated as given external inputs rather than derived from or redefined in terms of the COOPO cyclic structure. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description that would reduce the claimed guarantees to the method's own outputs by construction. The cyclic offline-online framework is presented as an independent algorithmic contribution that leverages (but does not presuppose) those assumptions, with empirical validation on D4RL benchmarks providing separate support. This qualifies as a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest primarily on standard RL coverage assumptions for the theoretical guarantees and on the effectiveness of the described cyclic structure; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption standard coverage assumptions
    Invoked to support the guaranteed monotonic improvement and better sample efficiency claims.

pith-pipeline@v0.9.0 · 5732 in / 1245 out tokens · 47556 ms · 2026-05-20T13:06:44.792639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 10 internal anchors

  1. [1]

    Safe learning in robotics: From learning-based control to safe reinforcement learning,

    L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022

  2. [2]

    Reinforcement learning in robotic applications: a comprehensive survey,

    B. Singh, R. Kumar, and V . P. Singh, “Reinforcement learning in robotic applications: a comprehensive survey,”Artificial Intelligence Review, vol. 55, no. 2, pp. 945–990, 2022

  3. [3]

    Deep reinforcement learning in smart manufacturing: A review and prospects,

    C. Li, P. Zheng, Y . Yin, B. Wang, and L. Wang, “Deep reinforcement learning in smart manufacturing: A review and prospects,”CIRP Journal of Manufacturing Science and Technology, vol. 40, pp. 75–101, 2023

  4. [4]

    Deep reinforcement learning in medical imaging: A literature review,

    S. K. Zhou, H. N. Le, K. Luu, H. V . Nguyen, and N. Ayache, “Deep reinforcement learning in medical imaging: A literature review,”Medical image analysis, vol. 73, p. 102193, 2021

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  6. [6]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  7. [7]

    Soft Actor-Critic Algorithms and Applications

    T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeelet al., “Soft actor-critic algorithms and applications,”arXiv preprint arXiv:1812.05905, 2018

  8. [8]

    A survey on offline reinforcement learning: Taxonomy, review, and open problems,

    R. F. Prudencio, M. R. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Transactions on Neural Networks and Learning Systems, 2023

  9. [9]

    Sample efficient offline-to-online reinforcement learning,

    S. Guo, L. Zou, H. Chen, B. Qu, H. Chi, P. S. Yu, and Y . Chang, “Sample efficient offline-to-online reinforcement learning,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 3, pp. 1299–1310, 2023

  10. [10]

    Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718,

    Y . Song, Y . Zhou, A. Sekhari, J. A. Bagnell, A. Krishnamurthy, and W. Sun, “Hybrid rl: Using both offline and online data can make rl efficient,”arXiv preprint arXiv:2210.06718, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

  11. [11]

    Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions

    Y . Luo, J. Kay, E. Grefenstette, and M. P. Deisenroth, “Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions,”arXiv preprint arXiv:2303.17396, 2023

  12. [12]

    Efficient online reinforcement learning with offline data,

    P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Efficient online reinforcement learning with offline data,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 1577–1594

  13. [13]

    Efficient online reinforcement learning fine-tuning need not retain offline data,

    Z. Zhou, A. Peng, Q. Li, S. Levine, and A. Kumar, “Efficient online reinforcement learning fine-tuning need not retain offline data,”arXiv preprint arXiv:2412.07762, 2024

  14. [14]

    Deep reinforcement learning for transporta- tion network combinatorial optimization: A survey,

    Q. Wang and C. Tang, “Deep reinforcement learning for transporta- tion network combinatorial optimization: A survey,”Knowledge-Based Systems, vol. 233, p. 107526, 2021

  15. [15]

    Reinforcement learning in artificial and biological systems,

    E. O. Neftci and B. B. Averbeck, “Reinforcement learning in artificial and biological systems,”Nature Machine Intelligence, vol. 1, no. 3, pp. 133–143, 2019

  16. [16]

    Conservative q-learning for offline reinforcement learning,

    A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”Advances in neural information processing systems, vol. 33, pp. 1179–1191, 2020

  17. [17]

    Offline Reinforcement Learning with Implicit Q-Learning

    I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning,”arXiv preprint arXiv:2110.06169, 2021

  18. [18]

    Behavior Regularized Offline Reinforcement Learning

    Y . Wu, G. Tucker, and O. Nachum, “Behavior regularized offline reinforcement learning,”arXiv preprint arXiv:1911.11361, 2019

  19. [19]

    Off-policy deep reinforcement learning without exploration,

    S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” inInternational conference on machine learning. PMLR, 2019, pp. 2052–2062

  20. [20]

    Offline rl policies should be trained to be adaptive,

    D. Ghosh, A. Ajay, P. Agrawal, and S. Levine, “Offline rl policies should be trained to be adaptive,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 7513–7530

  21. [21]

    Behavior proximal policy optimization,

    Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y . Guo, “Behavior proximal policy optimization,”arXiv preprint arXiv:2302.11312, 2023

  22. [22]

    Morel: Model-based offline reinforcement learning,

    R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, “Morel: Model-based offline reinforcement learning,”Advances in neural infor- mation processing systems, vol. 33, pp. 21 810–21 823, 2020

  23. [23]

    Mopo: Model-based offline policy optimization,

    T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y . Zou, S. Levine, C. Finn, and T. Ma, “Mopo: Model-based offline policy optimization,”Advances in Neural Information Processing Systems, vol. 33, pp. 14 129–14 142, 2020

  24. [24]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    A. Nair, A. Gupta, M. Dalal, and S. Levine, “Awac: Accelerating online reinforcement learning with offline datasets,”arXiv preprint arXiv:2006.09359, 2020

  25. [25]

    Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,

    M. Nakamoto, S. Zhai, A. Singh, M. Sobol Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine, “Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,”Advances in Neural Information Processing Systems, vol. 36, pp. 62 244–62 269, 2023

  26. [26]

    Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization.arXiv preprint arXiv:2311.03351,

    K. Lei, Z. He, C. Lu, K. Hu, Y . Gao, and H. Xu, “Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization,”arXiv preprint arXiv:2311.03351, 2023

  27. [27]

    Adaptive policy learning for offline-to-online reinforcement learning,

    H. Zheng, X. Luo, P. Wei, X. Song, D. Li, and J. Jiang, “Adaptive policy learning for offline-to-online reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, 2023, pp. 11 372–11 380

  28. [28]

    Efficient online rl fine tuning with offline pre-trained policy only,

    W. Xiao, J. Liu, Z. Zhuang, R. Suo, S. Lyu, and D. Wang, “Efficient online rl fine tuning with offline pre-trained policy only,”arXiv preprint arXiv:2505.16856, 2025

  29. [29]

    Improving offline- to-online reinforcement learning with q-ensembles,

    K. Zhao, Y . Ma, J. Liu, J. Hao, Y . Zheng, and Z. Meng, “Improving offline- to-online reinforcement learning with q-ensembles,” inICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023

  30. [30]

    Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl,

    Q.-W. Luo, M.-K. Xie, Y . Wang, and S.-J. Huang, “Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl,”Advances in Neural Information Processing Systems, vol. 37, pp. 108 167–108 207, 2024

  31. [31]

    Augmenting online rl with offline data is all you need: A unified hybrid rl algorithm design and analysis,

    R. Huang, D. Li, C. Shi, C. Shen, and J. Yang, “Augmenting online rl with offline data is all you need: A unified hybrid rl algorithm design and analysis,”arXiv preprint arXiv:2505.13768, 2025

  32. [32]

    Reinforcement Learning with Action Chunking

    Q. Li, Z. Zhou, and S. Levine, “Reinforcement learning with action chunking,”arXiv preprint arXiv:2507.07969, 2025

  33. [33]

    Ppo-clip attains global optimality: Towards deeper understandings of clipping,

    N.-C. Huang, P.-C. Hsieh, K.-H. Ho, and I.-C. Wu, “Ppo-clip attains global optimality: Towards deeper understandings of clipping,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 600–12 607

  34. [34]

    Trust region policy optimization,

    J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

  35. [35]

    Continuous control with deep reinforcement learning

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015

  36. [36]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    X. B. Peng, A. Kumar, G. Zhang, and S. Levine, “Advantage-weighted regression: Simple and scalable off-policy reinforcement learning,”arXiv preprint arXiv:1910.00177, 2019

  37. [37]

    H. K. Khalil,Nonlinear Systems. Prentice Hall, 2002

  38. [38]

    K. J. Åström and B. Wittenmark,Adaptive Control, 2nd ed. Addison- Wesley, 1995

  39. [39]

    T. M. Cover,Elements of information theory. John Wiley & Sons, 1999

  40. [40]

    Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,

    T. Xie, N. Jiang, H. Wang, C. Xiong, and Y . Bai, “Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,” Advances in neural information processing systems, vol. 34, pp. 27 395– 27 407, 2021

  41. [41]

    Bridging offline reinforcement learning and imitation learning: A tale of pessimism,

    P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell, “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 702– 11 716, 2021

  42. [42]

    Approximately optimal approximate reinforcement learning,

    S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” inProceedings of the nineteenth international conference on machine learning, 2002, pp. 267–274

  43. [43]

    Constrained policy optimization,

    J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inInternational conference on machine learning. PMLR, 2017, pp. 22–31

  44. [44]

    Refinements of pinsker’s inequality,

    A. A. Fedotov, P. Harremoës, and F. Topsoe, “Refinements of pinsker’s inequality,”IEEE Transactions on Information Theory, vol. 49, no. 6, pp. 1491–1498, 2003

  45. [45]

    Z., Balasubramanian, K., Chewi, S., and Salim, A

    G. Garrigos and R. M. Gower, “Handbook of convergence theorems for (stochastic) gradient methods,”arXiv preprint arXiv:2301.11235, 2023

  46. [46]

    On finite-time convergence of actor-critic algorithm,

    S. Qiu, Z. Yang, J. Ye, and Z. Wang, “On finite-time convergence of actor-critic algorithm,”IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 2, pp. 652–664, 2021

  47. [47]

    A general sample complexity analysis of vanilla policy gradient,

    R. Yuan, R. M. Gower, and A. Lazaric, “A general sample complexity analysis of vanilla policy gradient,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 3332–3380

  48. [48]

    Improving sample complexity bounds for (natural) actor-critic algorithms,

    T. Xu, Z. Wang, and Y . Liang, “Improving sample complexity bounds for (natural) actor-critic algorithms,”Advances in Neural Information Processing Systems, vol. 33, pp. 4358–4369, 2020

  49. [49]

    Mujoco: A physics engine for model- based control,

    E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model- based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

  50. [50]

    Behavioral Cloning from Observation

    F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” arXiv preprint arXiv:1805.01954, 2018

  51. [51]

    A connection between one-step rl and critic regularization in reinforcement learning,

    B. Eysenbach, M. Geist, S. Levine, and R. Salakhutdinov, “A connection between one-step rl and critic regularization in reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 9485–9507

  52. [52]

    A minimalist approach to offline reinforcement learning,

    S. Fujimoto and S. S. Gu, “A minimalist approach to offline reinforcement learning,”Advances in neural information processing systems, vol. 34, pp. 20 132–20 145, 2021

  53. [53]

    Online decision transformer,

    Q. Zheng, A. Zhang, and A. Grover, “Online decision transformer,” ininternational conference on machine learning. PMLR, 2022, pp. 27 042–27 059

  54. [54]

    Policy expansion for bridging offline-to- online reinforcement learning,

    H. Zhang, W. Xu, and H. Yu, “Policy expansion for bridging offline-to- online reinforcement learning,”arXiv preprint arXiv:2302.00935, 2023

  55. [55]

    Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,

    S. Lee, Y . Seo, K. Lee, P. Abbeel, and J. Shin, “Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,” inConference on Robot Learning. PMLR, 2022, pp. 1702–1712