pith. sign in

arxiv: 2606.05559 · v1 · pith:NBFXUUD4new · submitted 2026-06-04 · 💻 cs.LG

CLaaS: Continual learning as a service for sample efficient online learning

Pith reviewed 2026-06-28 02:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learningonline learningexperience replaylanguage model agentsforward transfercatastrophic forgettingsample efficiency
0
0 comments X

The pith

CLaaS lets language-model agents adapt online from single-use rollouts by storing them in a replay buffer for gradient reuse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how deployed agents must handle distribution shifts in environments that cannot be reset, so each scenario yields only one rollout. It introduces CLaaS, a service that keeps an experience replay buffer behind a chat API and performs asynchronous parametric updates on the stored data. On an adversarial streaming task the authors find that these updates produce better forward transfer to new scenarios and less forgetting of earlier ones than in-context learning alone, and that the replay step is required to reach the reported sample efficiency.

Core claim

In an experiential online continual learning setting where each scenario is encountered only once, storing rollouts in an experience replay buffer enables asynchronous gradient-based training that yields superior forward transfer and reduced forgetting relative to in-context learning.

What carries the argument

An experience replay buffer that stores single-use agent rollouts so gradients can be reused during asynchronous training.

If this is right

  • Parametric updates produce higher forward transfer to future tasks than in-context learning.
  • Parametric updates produce lower forgetting of prior tasks than in-context learning.
  • The replay buffer is required to reach the observed sample efficiency.
  • The mechanism can be hidden behind a standard chat API without changing the agent interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same replay-plus-asynchronous-update pattern could be applied in other non-resettable domains such as physical robotics.
  • Long-running deployments might accumulate enough replay data to support periodic model compression or distillation steps.
  • The separation of rollout collection from training opens the possibility of scaling the service to many agents sharing one buffer.

Load-bearing premise

The replay buffer can be filled and sampled asynchronously from single-use rollouts without introducing bias or instability that cancels the reported gains in transfer and retention.

What would settle it

A controlled run on the same adversarial task in which replay is disabled yet forward transfer and forgetting metrics remain at least as good as the replay-enabled condition.

Figures

Figures reproduced from arXiv: 2606.05559 by Barak Widawsky, Kion Fallah, Qingqing Mao, Silen Naihin.

Figure 1
Figure 1. Figure 1: CLaaS enables experiential learning of agents from a stream of rollouts gathered during deployment. Gradient reuse from a replay buffer improves sample efficiency during async training, leading to faster generalization with less data. improvements that generalize to future tasks while avoiding degradation on previous tasks. Parametric updates via on￾policy reinforcement learning (RL) have been effective fo… view at source ↗
Figure 2
Figure 2. Figure 2: CLaaS: continual learning as-a-service for online policy improvements via async training. Given any user agent harness that uses a chat API, CLaaS collects live rollouts in an experience replay buffer B. The training engine improves the policy by sampling batches, in addition to rewards gathered from the environment, for gradient updates to a LoRA that gets hot-reloaded into the inference server. The contr… view at source ↗
Figure 3
Figure 3. Figure 3: Average defender success rate across splits at ev￾ery checkpoint under different replay eviction ages with REIN￾FORCE++. Performance improves until Amax = 25 before desta￾bilizing at larger replay ages. 4. Experiments 4.1. Experimental Setup Experiments are conducted using CLaaS with a sequence of scenarios that induce adversarial distribution shift. The policy is assigned reward for responses which comply… view at source ↗
Figure 4
Figure 4. Figure 4: Walkthrough of one adaptive IH-Challenge scenario. The attacker receives ICL feedback after each turn and adapts its strategy. The defender’s responses are graded by a per-scenario Python verifier. Successful defenses produce r = 1 training signals; failures produce r = 0 with corrective feedback for SDPO reprompting. • Score (defender pass rate): SDPO converges to higher scores faster and maintains them; … view at source ↗
Figure 5
Figure 5. Figure 5: Checkpoint-split transfer matrices for each method, averaged over 9 trials. Each cell shows defender success rate (%) when checkpoint k (row) is evaluated on split j (column). SDPO achieves uniformly high performance across the matrix, indicating both strong forward transfer and minimal forgetting [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics comparison across methods. Left: defender score over training steps. Center: policy entropy. Right: policy gradient objective. Thin lines are raw values; bold lines are smoothed (window=10). SDPO maintains higher entropy and lower PG loss variance throughout training. C.2. Method-Specific Configurations Each method uses the shared configuration above and overrides algorithm-specific hyper… view at source ↗
Figure 7
Figure 7. Figure 7: Left: timing breakdown per training step for each method. Right: replay buffer size over training steps, showing the steady-state fill level with async collection and age-based eviction. Periodic drops in SDPO buffer size correspond to split boundaries where stale records are evicted [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: SDPO teacher reprompt templates used on IH-Challenge. <SOLUTION> is replaced by the student’s prior visible response (think blocks stripped). The template produces a conditional prompt for the EMA teacher to generate an improved response [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Deployed large language model agents must adapt to distribution shift in dynamic environments. Ideally, adaptation can be performed from accumulated agent experiences and retain prior capabilities while transferring to future tasks. However, agent actions and environmental transitions can only be sampled once per scenario, as real-world environments cannot be trivially reset. To this end, we investigate an experiential and online continual learning setting in which agents learn from a stream of scenarios. We propose continual learning as-a-service (CLaaS), a system which enables agents to improve during deployment, abstracted behind a chat API. To increase sample efficiency, CLaaS stores rollouts in an experience replay buffer for gradient reuse during asynchronous training. We evaluate CLaaS on an adversarial task, demonstrating that parametric updates lead to superior forward transfer and less forgetting than in-context learning, with replay being a critical choice for sample efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CLaaS, a continual learning as-a-service system for LLM agents adapting to distribution shift in non-resettable environments. Agents accumulate experiences from a stream of scenarios; rollouts are stored in an experience replay buffer to enable asynchronous gradient-based parametric updates. The central empirical claim is that these parametric updates yield superior forward transfer and less forgetting than in-context learning, with replay being critical for sample efficiency, as demonstrated on an adversarial task.

Significance. If the results hold with appropriate controls, the work addresses a practically important setting for deployed agents where environments cannot be reset and experiences are single-use. The emphasis on replay for gradient reuse in an online continual-learning regime is a concrete contribution to sample-efficient adaptation while retaining prior capabilities.

major comments (2)
  1. [Abstract] Abstract: the claim that parametric updates lead to superior forward transfer and less forgetting is presented without any quantitative metrics, baselines, error bars, dataset details, or exclusion criteria, so it is not possible to assess whether the data support the superiority statement.
  2. [CLaaS system description] CLaaS system description: the replay buffer is populated from single-use rollouts under a changing policy and sampled asynchronously for gradient reuse, yet no off-policy correction (importance sampling, prioritized replay, or explicit bias analysis) is described; this directly threatens the sample-efficiency and stability claims that underpin the reported gains in transfer and retention.
minor comments (1)
  1. [Abstract] The phrase 'adversarial task' is used without reference to a concrete benchmark, environment, or task definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that parametric updates lead to superior forward transfer and less forgetting is presented without any quantitative metrics, baselines, error bars, dataset details, or exclusion criteria, so it is not possible to assess whether the data support the superiority statement.

    Authors: We agree that the abstract lacks quantitative support for the claims. In the revised manuscript we will update the abstract to include specific metrics on forward transfer and retention, the in-context learning baseline, error bars, and details on the adversarial task and evaluation criteria. revision: yes

  2. Referee: [CLaaS system description] CLaaS system description: the replay buffer is populated from single-use rollouts under a changing policy and sampled asynchronously for gradient reuse, yet no off-policy correction (importance sampling, prioritized replay, or explicit bias analysis) is described; this directly threatens the sample-efficiency and stability claims that underpin the reported gains in transfer and retention.

    Authors: We acknowledge that the manuscript does not describe off-policy corrections. We will revise the CLaaS system description to include an explicit discussion of policy-induced bias together with either an importance-sampling correction or a quantitative bias analysis, so that the sample-efficiency and stability claims rest on firmer methodological ground. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system comparison with no derivation chain

full rationale

The paper describes an empirical evaluation of CLaaS, a continual learning system using experience replay for online adaptation of LLM agents. The central claims concern experimental outcomes on forward transfer and forgetting when comparing parametric updates (with replay) against in-context learning. No mathematical derivations, fitted parameters renamed as predictions, self-citations as load-bearing uniqueness theorems, or ansatzes are present in the provided text. The replay buffer usage is presented as an implementation choice whose benefits are measured experimentally rather than derived by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that real-world scenarios permit only single samples per rollout and that standard replay techniques transfer directly to LLM agents without new instabilities.

axioms (1)
  • domain assumption Real-world environments cannot be trivially reset, so agent actions and transitions can be sampled only once per scenario.
    Explicitly stated in the abstract as the motivation for the experiential online setting.

pith-pipeline@v0.9.1-grok · 5687 in / 1123 out tokens · 47393 ms · 2026-06-28T02:46:29.727078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 4 canonical work pages

  1. [1]

    M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J

    Agarwal, R., Singh, A., Zhang, L. M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J. D., Chu, E., Behbahani, F., Faust, A., and Larochelle, H. Many-shot in-context learning, 2024. URL https://arxiv.org/abs/2404.11018

  2. [2]

    Biderman, D., Portes, J., Ortiz, J. J. G., Paul, M., Greengard, P., Jennings, C., King, D., Havens, S., Chiley, V., Frankle, J., Blakeney, C., and Cunningham, J. P. Lora learns less and forgets less, 2024. URL https://arxiv.org/abs/2405.09673

  3. [3]

    Challenges of real-world reinforcement learning

    Dulac-Arnold, G., Mankowitz, D., and Hester, T. Challenges of real-world reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, 2019

  4. [4]

    F., Zhu, S., Choquette-Choo, C

    Guo, C., Ceron Uribe, J. F., Zhu, S., Choquette-Choo, C. A., Lin, S., Kandpal, N., Nasr, M., Rai, Toyer, S., Wang, M., Yu, Y., Beutel, A., and Xiao, K. Ih-challenge: A training dataset to improve instruction hierarchy on frontier llms. March 2026. URL https://cdn.openai.com/pdf/14e541fa-7e48-4d79-9cbf-61c3cde3e263/ih-challenge-paper.pdf

  5. [5]

    Hybridflow: A flexible and efficient rlhf framework

    Guo, H., Jiang, Z., , et al. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the 20th European Conference on Computer Systems (EuroSys), 2025. URL https://github.com/volcengine/verl

  6. [6]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

  7. [7]

    K., Xu, H., and Shen, W

    Hu, J., Liu, J. K., Xu, H., and Shen, W. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization, 2025. URL https://arxiv.org/abs/2501.03262

  8. [8]

    Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024

    Huang, J., Cui, L., Wang, A., Yang, C., Liao, X., Song, L., Yao, J., and Su, J. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024. URL https://arxiv.org/abs/2403.01244

  9. [9]

    K., Guestrin, C., and Krause, A

    Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Buening, T. K., Guestrin, C., and Krause, A. Reinforcement learning via self-distillation, 2026. URL https://arxiv.org/abs/2601.20802

  10. [10]

    A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521–3526, March 2017. ISSN 1091-6490. ...

  11. [11]

    Konda, V. R. and Tsitsiklis, J. N. On actor-critic algorithms. SIAM Journal on Control and Optimization, 42 0 (4): 0 1143--1166, 2003. doi:10.1137/S0363012901385691

  12. [12]

    H., Gonzalez, J

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  13. [13]

    Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirzi, H. Tulu 3: Pushing frontiers in open language model post-training, 2025. UR...

  14. [14]

    Machine Learning , author =

    Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8 0 (3--4): 0 293--321, 1992. doi:10.1007/BF00992699

  15. [15]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025

    Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025. URL https://arxiv.org/abs/2308.08747

  16. [16]

    Cohen , abstract =

    McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pp.\ 109--165. Academic Press, 1989. doi:https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368

  17. [17]

    Reinforcement learning finetunes small subnetworks in large language models, 2025

    Mukherjee, S., Yuan, L., Hakkani-Tur, D., and Peng, H. Reinforcement learning finetunes small subnetworks in large language models, 2025. URL https://arxiv.org/abs/2505.11711

  18. [18]

    Pipelinerl: Faster on-policy reinforcement learning for long sequence generation, 2025

    Piché, A., Kamalloo, E., Pardinas, R., Chen, X., and Bahdanau, D. Pipelinerl: Faster on-policy reinforcement learning for long sequence generation, 2025. URL https://arxiv.org/abs/2509.19128

  19. [19]

    Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H. icarl: Incremental classifier and representation learning, 2017. URL https://arxiv.org/abs/1611.07725

  20. [20]

    Proximal policy optimization algorithms, 2017

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

  21. [21]

    K., Wu, Y., and Guo, D

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  22. [22]

    Rl's razor: Why online reinforcement learning forgets less, 2025

    Shenfeld, I., Pari, J., and Agrawal, P. Rl's razor: Why online reinforcement learning forgets less, 2025. URL https://arxiv.org/abs/2509.04259

  23. [23]

    Test-time training with self-supervision for generalization under distribution shifts

    Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M. Test-time training with self-supervision for generalization under distribution shifts. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 9229--9248. PMLR, 13--18 Jul 2020. URL...

  24. [24]

    Openclaw-rl: Train any agent simply by talking, 2026

    Wang, Y., Chen, X., Jin, X., Wang, M., and Yang, L. Openclaw-rl: Train any agent simply by talking, 2026. URL https://arxiv.org/abs/2603.10165

  25. [25]

    Qwen3 technical report, 2025

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

  26. [26]

    Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents, 2026

    Zhang, H., Liu, M., Zhang, S., Han, S., Hu, J., Jin, Z., Zhang, Y., Diao, S., Lu, X., Xu, B., Yu, Z., Kautz, J., and Dong, Y. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents, 2026. URL https://arxiv.org/abs/2603.18815