pith. sign in

arxiv: 2601.03938 · v2 · submitted 2026-01-07 · 💻 cs.LG · cs.AI· cs.CL

FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning

Pith reviewed 2026-05-16 16:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords continual learningcatastrophic forgettingmemory replayforgetting curvelarge language modelsoptimizer updatesEbbinghaus forgetting
0
0 comments X

The pith

FOREVER schedules memory replays for LLMs using a forgetting curve timed by optimizer update magnitude instead of training steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a continual learning approach for large language models that replays selected past examples at intervals drawn from the Ebbinghaus forgetting curve. It replaces fixed step counts with a model-centric clock based on the size of parameter updates during optimization. This change lets replay timing and intensity track how much the model has actually changed rather than how many batches have passed. Experiments on three benchmarks with models from 0.6B to 13B parameters show reduced forgetting while new tasks are learned. A reader would care because it offers a direct way to keep earlier knowledge alive when an LLM must absorb new data in sequence.

Core claim

FOREVER defines model time from the magnitude of optimizer updates, applies a forgetting-curve scheduler to decide when to replay and how intensely, and adds intensity-aware regularization; the result is consistent mitigation of catastrophic forgetting on three continual-learning benchmarks across models ranging from 0.6B to 13B parameters.

What carries the argument

The forgetting curve-based replay scheduler that treats optimizer update magnitude as model time to set replay intervals and regularization strength.

If this is right

  • Replay decisions become responsive to the model's actual parameter drift rather than external step counts.
  • Regularization strength scales automatically with the predicted forgetting rate at each replay point.
  • The same scheduler works across model scales from hundreds of millions to tens of billions of parameters.
  • Unnecessary replays at early stages of stability are reduced, freeing compute for new data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same update-magnitude clock could be tested on non-LLM architectures where forgetting follows a comparable curve.
  • If the proxy holds, replay frequency could be lowered further in late training without extra forgetting.
  • The approach invites direct measurement of update magnitude versus forgetting on streaming data tasks outside the three benchmarks.

Load-bearing premise

The magnitude of optimizer updates provides a faithful proxy for the model's internal time progression that matches the shape of the Ebbinghaus forgetting curve.

What would settle it

A controlled run in which the correlation between update magnitudes and measured forgetting rates on a held-out task is near zero, or in which fixed-step replay matches FOREVER performance on the same benchmarks.

Figures

Figures reproduced from arXiv: 2601.03938 by Hao Wang, Jian Li, Philip S. Yu, Xiao-Ming Wu, Xu Chu, Yasha Wang, Yiran Liu, Yujie Feng, Zhaolu Kang.

Figure 1
Figure 1. Figure 1: Aligning human time and model time in FOREVER. FOREVER aligns Ebbinghaus-inspired human replay intervals with a model-centric timeline defined by accumulated parameter update magnitude, enabling replay to be triggered based on the model’s actual learning progress. representations, disrupting the stability–plasticity trade-off and resulting in catastrophic forgetting (CF) (McCloskey and Cohen, 1989). Replay… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of FOREVER. FOREVER decomposes replay into two coupled decisions—when to replay and how to replay—both grounded in model update dynamics. Parameter update magnitudes ∆t track model evolution over training steps, whose accumulation defines a model-centric notion of time (virtual “model days”). When to replay (Left): accumulated model time τt measures how far the model has progressed in parameter sp… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of FOREVER with different backbones on the SuperNI Benchmark. FOREVER Generalizes Consistently Across Model Scales. We further evaluate FOREVER across backbone models ranging from 0.6B to 13B parameters, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of model-centric replay dy￾namics during training. Left: step-wise parameter update magnitude ∆t across training steps. Right: ac￾cumulated model-centric time τt with replay trigger points annotated. Under the proposed model-centric time definition, replay is triggered at different training steps for different tasks, reflecting task-dependent learn￾ing dynamics rather than fixed step-based sc… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of task loss (L old task) on memory samples and scaled replay regularization loss (βbase · Lreg) at replay stages. we record (i) the task loss on memory samples at the end of replay, and (ii) the scaled regularization loss applied during replay [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Replay dynamics across datasets and task orders. Each subfigure shows ∆t, accumulated model time τt, and replay trigger points for one task order, illustrating adaptive replay scheduling based on model-centric time [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Continual learning (CL) for large language models (LLMs) aims to enable sequential knowledge acquisition without catastrophic forgetting. Memory replay methods are widely used for their practicality and effectiveness, but most rely on fixed, step-based heuristics that often misalign with the model's actual learning progress, since identical training steps can result in varying degrees of parameter change. Motivated by recent findings that LLM forgetting mirrors the Ebbinghaus human forgetting curve, we propose FOREVER (FORgEtting curVe-inspired mEmory Replay), a novel CL framework that aligns replay schedules with a model-centric notion of time. FOREVER defines model time using the magnitude of optimizer updates, allowing forgetting curve-inspired replay intervals to align with the model's internal evolution rather than raw training steps. Building on this approach, FOREVER incorporates a forgetting curve-based replay scheduler to determine when to replay and an intensity-aware regularization mechanism to adaptively control how to replay. Extensive experiments on three CL benchmarks and models ranging from 0.6B to 13B parameters demonstrate that FOREVER consistently mitigates catastrophic forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FOREVER, a continual learning framework for LLMs that defines a model-centric notion of time via the cumulative magnitude of optimizer updates, uses this to schedule memory replay according to an Ebbinghaus-inspired forgetting curve, and adds an intensity-aware regularization term to control replay strength. It reports that the method consistently reduces catastrophic forgetting relative to baselines across three CL benchmarks and model scales from 0.6B to 13B parameters.

Significance. If the update-magnitude proxy is shown to track actual forgetting dynamics, the work would offer a principled alternative to fixed-step replay heuristics, potentially improving sample efficiency in LLM continual learning. The breadth of evaluation across model sizes is a positive feature, but the central claim rests on an unverified alignment between ||Δθ|| and task-loss rise, which limits immediate impact.

major comments (2)
  1. [§3.2] §3.2 (model-time definition): the claim that cumulative optimizer-update norms provide a faithful proxy for Ebbinghaus-style forgetting requires explicit validation (e.g., a correlation plot or table showing alignment between ||Δθ||_cum and rise in loss on held-out prior tasks); without it, performance gains cannot be confidently attributed to curve alignment rather than replay frequency or the added regularization term alone.
  2. [§4] §4 (experimental results): the abstract and results section report consistent gains on three benchmarks, yet no error bars, statistical significance tests, or full ablations isolating the scheduler from the intensity regularizer are described; this makes it impossible to verify robustness across the 0.6B–13B range or rule out that simpler step-based replay with the same regularization would suffice.
minor comments (2)
  1. [Abstract] Abstract: the reference to “recent findings” on LLM forgetting mirroring the Ebbinghaus curve should include a specific citation rather than remaining generic.
  2. [§3.1] Notation: the symbol for cumulative update magnitude is introduced without an explicit equation label, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve the validation of our core claims and the rigor of the experimental reporting.

read point-by-point responses
  1. Referee: [§3.2] the claim that cumulative optimizer-update norms provide a faithful proxy for Ebbinghaus-style forgetting requires explicit validation (e.g., a correlation plot or table showing alignment between ||Δθ||_cum and rise in loss on held-out prior tasks); without it, performance gains cannot be confidently attributed to curve alignment rather than replay frequency or the added regularization term alone.

    Authors: We agree that explicit validation of the update-magnitude proxy is necessary to strengthen attribution of gains to the forgetting-curve alignment. Although the model-centric time definition is motivated by prior observations of Ebbinghaus-like forgetting in LLMs, we will add a dedicated analysis (new figure and table) in the revised §3.2. This will report Pearson correlations and a scatter plot between cumulative ||Δθ|| and the rise in held-out loss on prior tasks across training stages, directly addressing the concern. revision: yes

  2. Referee: [§4] the abstract and results section report consistent gains on three benchmarks, yet no error bars, statistical significance tests, or full ablations isolating the scheduler from the intensity regularizer are described; this makes it impossible to verify robustness across the 0.6B–13B range or rule out that simpler step-based replay with the same regularization would suffice.

    Authors: We acknowledge the current lack of error bars, significance testing, and isolating ablations. In the revision we will (i) report mean ± standard deviation across at least three random seeds for all metrics, (ii) add paired t-test p-values for the main comparisons, and (iii) expand §4 with a full ablation that includes a step-based replay baseline equipped with the identical intensity regularizer. These additions will allow readers to assess robustness across model scales and isolate the contribution of the scheduler. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proxy and external motivation remain independent of target metrics

full rationale

The derivation defines model-centric time directly as the cumulative magnitude of optimizer updates (a measurable input quantity) and applies replay intervals drawn from an Ebbinghaus functional form whose motivation is attributed to external prior findings on LLM forgetting. No equation fits a parameter to the evaluation forgetting metric and then re-uses that fit as the 'prediction' or scheduler; the intensity regularization adapts replay strength from the same proxy without tautological reduction. No self-citation is shown to be load-bearing for the central claim, and the chain does not rename or smuggle an ansatz that collapses to the input data. The reported gains therefore rest on an independent proxy rather than on constructional equivalence to the benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that LLM forgetting follows the Ebbinghaus curve shape and that update magnitude serves as a suitable time axis; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption LLM forgetting mirrors the Ebbinghaus human forgetting curve
    Stated as motivation from recent findings; used to justify the replay scheduler shape.
  • domain assumption Magnitude of optimizer updates defines a faithful model-centric time
    Central to replacing step-based heuristics with update-magnitude intervals.

pith-pipeline@v0.9.0 · 5516 in / 1254 out tokens · 43149 ms · 2026-05-16T16:26:19.194359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

    cs.AI 2026-04 unverdicted novelty 7.0

    SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.

  2. Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.

  3. Not All Memories Age the Same: Autodiscovery of Adaptive Decay in Knowledge Graphs

    cs.IR 2026-04 unverdicted novelty 6.0

    Knowledge graphs should use data-driven hierarchical decay surfaces based on velocity and volatility instead of uniform forgetting curves to better identify currently relevant facts.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 3 Pith papers · 1 internal anchor

  1. [1]

    InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 17167–17186

    Mitigating catastrophic forgetting in language transfer via model merging. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 17167–17186. Andrew Bai, Chih-Kuan Yeh, Cho-Jui Hsieh, and Ankur Taly. 2025. An efficient rehearsal scheme for catas- trophic forgetting mitigation during multi-stage fine- tuning.Preprint, arXiv:2402....

  2. [2]

    arXiv preprint arXiv:2503.01595 (2025)

    LoRAMoE: Alleviating world knowledge for- getting in large language models via MoE-style plu- gin. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1932–1945, Bangkok, Thailand. Association for Computational Linguistics. Wenyu Du, Shuang Cheng, Tongxu Luo, Zihan Qiu, Zeyu Huang, Ka ...

  3. [3]

    Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024

    Interpretable catastrophic forgetting of large language model fine-tuning via instruction vector. arXiv preprint arXiv:2406.12227. Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei

  4. [4]

    Hankyul Kang, Gregor Seifer, Donghyun Lee, and Jong- bin Ryu

    Unlocking the power of function vectors for characterizing and mitigating catastrophic forget- ting in continual instruction tuning.arXiv preprint arXiv:2502.11019. Hankyul Kang, Gregor Seifer, Donghyun Lee, and Jong- bin Ryu. 2025a. Do your best and get enough rest for continual learning. InProceedings of the Computer Vision and Pattern Recognition Confe...

  5. [5]

    Zixuan Ke, Bing Liu, Wenhan Xiong, Asli Celikyilmaz, and Haoran Li

    Achieving forgetting prevention and knowl- edge transfer in continual learning.Advances in Neural Information Processing Systems, 34:22443– 22456. Zixuan Ke, Bing Liu, Wenhan Xiong, Asli Celikyilmaz, and Haoran Li. 2023. Sub-network discovery and soft-masking for continual learning of mixed tasks. arXiv preprint arXiv:2310.09436. James Kirkpatrick, Razvan...

  6. [6]

    InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 18489–18504

    Dynamic expert specialization: Towards catas- trophic forgetting-free multi-domain moe adaptation. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 18489–18504. Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, and 1 others

  7. [7]

    Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

    Advances and challenges in foundation agents: From brain-inspired intelligence to evolution- ary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990. Aojun Lu, Hangjie Yuan, Tao Feng, and Yanan Sun

  8. [8]

    arXiv preprint arXiv:2506.03951 , year=

    Rethinking the stability-plasticity trade-off in continual learning from an architectural perspective. Preprint, arXiv:2506.03951. Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. 2024. Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models.arXiv preprint arXiv:...

  9. [9]

    Razdaibiedina, Y

    Progressive prompts: Continual learning for language models.arXiv preprint arXiv:2301.12314. Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. 2024. Analyzing and reducing catas- trophic forgetting in parameter efficient tuning.arXiv preprint arXiv:2402.18865. Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, ...

  10. [10]

    arXiv preprint arXiv:2403.18886 (2024) 26

    Self-expansion of pre-trained models with mixture of adapters for continual learning.arXiv preprint arXiv:2403.18886. Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. 2023. Orthogonal subspace learning for lan- guage model continual learning. InFindings of the Association for Computational Linguistic...

  11. [11]

    task639_multi_woz_user_utterance_generation dialogue generation Rouge-L

  12. [12]

    task1590_diplomacy_text_generation dialogue generation Rouge-L

  13. [13]

    task1729_personachat_generate_next dialogue generation Rouge-L

  14. [14]

    task181_outcome_extraction information extraction Rouge-L

  15. [15]

    task748_glucose_reverse_cause_event_detection information extraction Rouge-L

  16. [16]

    task1510_evalution_relation_extraction information extraction Rouge-L

  17. [17]

    task002_quoref_answer_generation question answering Rouge-L

  18. [18]

    task073_commonsenseqa_answer_generation question answering Rouge-L

  19. [19]

    task591_sciq_answer_generation question answering Rouge-L

  20. [20]

    task511_reddit_tifu_long_text_summarization summarization Rouge-L

  21. [21]

    task1290_xsum_summarization summarization Rouge-L

  22. [22]

    task1572_samsum_summary summarization Rouge-L

  23. [23]

    task363_sst2_polarity_classification sentiment analysis accuracy

  24. [24]

    task875_emotion_classification sentiment analysis accuracy

  25. [25]

    Dataset name Category Task Domain Metric

    task1687_sentiment140_classification sentiment analysis accuracy Table 7: The details of 15 datasets in the SuperNI Benchmark (Wang et al., 2022). Dataset name Category Task Domain Metric

  26. [26]

    Yelp CL Benchmark sentiment analysis Yelp reviews accuracy

  27. [27]

    Amazon CL Benchmark sentiment analysis Amazon reviews accuracy

  28. [28]

    DBpedia CL Benchmark topic classification Wikipedia accuracy

  29. [29]

    Yahoo CL Benchmark topic classification Yahoo Q&A accuracy

  30. [30]

    AG News CL Benchmark topic classification news accuracy

  31. [31]

    MNLI GLUE natural language inference various accuracy

  32. [32]

    QQP GLUE paragraph detection Quora accuracy

  33. [33]

    RTE GLUE natural language inference news, Wikipedia accuracy

  34. [34]

    SST-2 GLUE sentiment analysis movie reviews accuracy

  35. [35]

    WiC SuperGLUE word sense disambiguation lexical databases accuracy

  36. [36]

    CB SuperGLUE natural language inference various accuracy

  37. [37]

    COPA SuperGLUE question and answering blogs, encyclopedia accuracy

  38. [38]

    BoolQA SuperGLUE boolean question and answering Wikipedia accuracy

  39. [39]

    MultiRC SuperGLUE question and answering various accuracy

  40. [40]

    First five tasks correspond to the standard CL benchmark (Zhang et al., 2015)

    IMDB SuperGLUE sentiment analysis movie reviews accuracy Table 8: The details of 15 classification datasets in the Long Sequence Benchmark (Razdai et al., 2022). First five tasks correspond to the standard CL benchmark (Zhang et al., 2015). Order Benchmark Task Sequence 1 Standard CL dbpedia→amazon→yahoo→ag 2 dbpedia→amazon→ag→yahoo 3 yahoo→amazon→ag→dbpe...