FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning
Pith reviewed 2026-05-16 16:26 UTC · model grok-4.3
The pith
FOREVER schedules memory replays for LLMs using a forgetting curve timed by optimizer update magnitude instead of training steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FOREVER defines model time from the magnitude of optimizer updates, applies a forgetting-curve scheduler to decide when to replay and how intensely, and adds intensity-aware regularization; the result is consistent mitigation of catastrophic forgetting on three continual-learning benchmarks across models ranging from 0.6B to 13B parameters.
What carries the argument
The forgetting curve-based replay scheduler that treats optimizer update magnitude as model time to set replay intervals and regularization strength.
If this is right
- Replay decisions become responsive to the model's actual parameter drift rather than external step counts.
- Regularization strength scales automatically with the predicted forgetting rate at each replay point.
- The same scheduler works across model scales from hundreds of millions to tens of billions of parameters.
- Unnecessary replays at early stages of stability are reduced, freeing compute for new data.
Where Pith is reading between the lines
- The same update-magnitude clock could be tested on non-LLM architectures where forgetting follows a comparable curve.
- If the proxy holds, replay frequency could be lowered further in late training without extra forgetting.
- The approach invites direct measurement of update magnitude versus forgetting on streaming data tasks outside the three benchmarks.
Load-bearing premise
The magnitude of optimizer updates provides a faithful proxy for the model's internal time progression that matches the shape of the Ebbinghaus forgetting curve.
What would settle it
A controlled run in which the correlation between update magnitudes and measured forgetting rates on a held-out task is near zero, or in which fixed-step replay matches FOREVER performance on the same benchmarks.
Figures
read the original abstract
Continual learning (CL) for large language models (LLMs) aims to enable sequential knowledge acquisition without catastrophic forgetting. Memory replay methods are widely used for their practicality and effectiveness, but most rely on fixed, step-based heuristics that often misalign with the model's actual learning progress, since identical training steps can result in varying degrees of parameter change. Motivated by recent findings that LLM forgetting mirrors the Ebbinghaus human forgetting curve, we propose FOREVER (FORgEtting curVe-inspired mEmory Replay), a novel CL framework that aligns replay schedules with a model-centric notion of time. FOREVER defines model time using the magnitude of optimizer updates, allowing forgetting curve-inspired replay intervals to align with the model's internal evolution rather than raw training steps. Building on this approach, FOREVER incorporates a forgetting curve-based replay scheduler to determine when to replay and an intensity-aware regularization mechanism to adaptively control how to replay. Extensive experiments on three CL benchmarks and models ranging from 0.6B to 13B parameters demonstrate that FOREVER consistently mitigates catastrophic forgetting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FOREVER, a continual learning framework for LLMs that defines a model-centric notion of time via the cumulative magnitude of optimizer updates, uses this to schedule memory replay according to an Ebbinghaus-inspired forgetting curve, and adds an intensity-aware regularization term to control replay strength. It reports that the method consistently reduces catastrophic forgetting relative to baselines across three CL benchmarks and model scales from 0.6B to 13B parameters.
Significance. If the update-magnitude proxy is shown to track actual forgetting dynamics, the work would offer a principled alternative to fixed-step replay heuristics, potentially improving sample efficiency in LLM continual learning. The breadth of evaluation across model sizes is a positive feature, but the central claim rests on an unverified alignment between ||Δθ|| and task-loss rise, which limits immediate impact.
major comments (2)
- [§3.2] §3.2 (model-time definition): the claim that cumulative optimizer-update norms provide a faithful proxy for Ebbinghaus-style forgetting requires explicit validation (e.g., a correlation plot or table showing alignment between ||Δθ||_cum and rise in loss on held-out prior tasks); without it, performance gains cannot be confidently attributed to curve alignment rather than replay frequency or the added regularization term alone.
- [§4] §4 (experimental results): the abstract and results section report consistent gains on three benchmarks, yet no error bars, statistical significance tests, or full ablations isolating the scheduler from the intensity regularizer are described; this makes it impossible to verify robustness across the 0.6B–13B range or rule out that simpler step-based replay with the same regularization would suffice.
minor comments (2)
- [Abstract] Abstract: the reference to “recent findings” on LLM forgetting mirroring the Ebbinghaus curve should include a specific citation rather than remaining generic.
- [§3.1] Notation: the symbol for cumulative update magnitude is introduced without an explicit equation label, which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve the validation of our core claims and the rigor of the experimental reporting.
read point-by-point responses
-
Referee: [§3.2] the claim that cumulative optimizer-update norms provide a faithful proxy for Ebbinghaus-style forgetting requires explicit validation (e.g., a correlation plot or table showing alignment between ||Δθ||_cum and rise in loss on held-out prior tasks); without it, performance gains cannot be confidently attributed to curve alignment rather than replay frequency or the added regularization term alone.
Authors: We agree that explicit validation of the update-magnitude proxy is necessary to strengthen attribution of gains to the forgetting-curve alignment. Although the model-centric time definition is motivated by prior observations of Ebbinghaus-like forgetting in LLMs, we will add a dedicated analysis (new figure and table) in the revised §3.2. This will report Pearson correlations and a scatter plot between cumulative ||Δθ|| and the rise in held-out loss on prior tasks across training stages, directly addressing the concern. revision: yes
-
Referee: [§4] the abstract and results section report consistent gains on three benchmarks, yet no error bars, statistical significance tests, or full ablations isolating the scheduler from the intensity regularizer are described; this makes it impossible to verify robustness across the 0.6B–13B range or rule out that simpler step-based replay with the same regularization would suffice.
Authors: We acknowledge the current lack of error bars, significance testing, and isolating ablations. In the revision we will (i) report mean ± standard deviation across at least three random seeds for all metrics, (ii) add paired t-test p-values for the main comparisons, and (iii) expand §4 with a full ablation that includes a step-based replay baseline equipped with the identical intensity regularizer. These additions will allow readers to assess robustness across model scales and isolate the contribution of the scheduler. revision: yes
Circularity Check
No significant circularity; proxy and external motivation remain independent of target metrics
full rationale
The derivation defines model-centric time directly as the cumulative magnitude of optimizer updates (a measurable input quantity) and applies replay intervals drawn from an Ebbinghaus functional form whose motivation is attributed to external prior findings on LLM forgetting. No equation fits a parameter to the evaluation forgetting metric and then re-uses that fit as the 'prediction' or scheduler; the intensity regularization adapts replay strength from the same proxy without tautological reduction. No self-citation is shown to be load-bearing for the central claim, and the chain does not rename or smuggle an ansatz that collapses to the input data. The reported gains therefore rest on an independent proxy rather than on constructional equivalence to the benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM forgetting mirrors the Ebbinghaus human forgetting curve
- domain assumption Magnitude of optimizer updates defines a faithful model-centric time
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FOREVER defines model time using the magnitude of optimizer updates... τ_t = Σ Δ_i where Δ_t = ||Θ_t − Θ_{t−1}||_2
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Ebbinghaus-guided replay schedule on model time... D_model = {d · τ_day}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems
SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
-
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training
Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.
-
Not All Memories Age the Same: Autodiscovery of Adaptive Decay in Knowledge Graphs
Knowledge graphs should use data-driven hierarchical decay surfaces based on velocity and volatility instead of uniform forgetting curves to better identify currently relevant facts.
Reference graph
Works this paper leans on
-
[1]
InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 17167–17186
Mitigating catastrophic forgetting in language transfer via model merging. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 17167–17186. Andrew Bai, Chih-Kuan Yeh, Cho-Jui Hsieh, and Ankur Taly. 2025. An efficient rehearsal scheme for catas- trophic forgetting mitigation during multi-stage fine- tuning.Preprint, arXiv:2402....
-
[2]
arXiv preprint arXiv:2503.01595 (2025)
LoRAMoE: Alleviating world knowledge for- getting in large language models via MoE-style plu- gin. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1932–1945, Bangkok, Thailand. Association for Computational Linguistics. Wenyu Du, Shuang Cheng, Tongxu Luo, Zihan Qiu, Zeyu Huang, Ka ...
-
[3]
Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024
Interpretable catastrophic forgetting of large language model fine-tuning via instruction vector. arXiv preprint arXiv:2406.12227. Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei
-
[4]
Hankyul Kang, Gregor Seifer, Donghyun Lee, and Jong- bin Ryu
Unlocking the power of function vectors for characterizing and mitigating catastrophic forget- ting in continual instruction tuning.arXiv preprint arXiv:2502.11019. Hankyul Kang, Gregor Seifer, Donghyun Lee, and Jong- bin Ryu. 2025a. Do your best and get enough rest for continual learning. InProceedings of the Computer Vision and Pattern Recognition Confe...
-
[5]
Zixuan Ke, Bing Liu, Wenhan Xiong, Asli Celikyilmaz, and Haoran Li
Achieving forgetting prevention and knowl- edge transfer in continual learning.Advances in Neural Information Processing Systems, 34:22443– 22456. Zixuan Ke, Bing Liu, Wenhan Xiong, Asli Celikyilmaz, and Haoran Li. 2023. Sub-network discovery and soft-masking for continual learning of mixed tasks. arXiv preprint arXiv:2310.09436. James Kirkpatrick, Razvan...
-
[6]
Dynamic expert specialization: Towards catas- trophic forgetting-free multi-domain moe adaptation. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 18489–18504. Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, and 1 others
work page 2025
-
[7]
Advances and challenges in foundation agents: From brain-inspired intelligence to evolution- ary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990. Aojun Lu, Hangjie Yuan, Tao Feng, and Yanan Sun
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv preprint arXiv:2506.03951 , year=
Rethinking the stability-plasticity trade-off in continual learning from an architectural perspective. Preprint, arXiv:2506.03951. Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. 2024. Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models.arXiv preprint arXiv:...
-
[9]
Progressive prompts: Continual learning for language models.arXiv preprint arXiv:2301.12314. Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. 2024. Analyzing and reducing catas- trophic forgetting in parameter efficient tuning.arXiv preprint arXiv:2402.18865. Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, ...
-
[10]
arXiv preprint arXiv:2403.18886 (2024) 26
Self-expansion of pre-trained models with mixture of adapters for continual learning.arXiv preprint arXiv:2403.18886. Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. 2023. Orthogonal subspace learning for lan- guage model continual learning. InFindings of the Association for Computational Linguistic...
-
[11]
task639_multi_woz_user_utterance_generation dialogue generation Rouge-L
-
[12]
task1590_diplomacy_text_generation dialogue generation Rouge-L
-
[13]
task1729_personachat_generate_next dialogue generation Rouge-L
-
[14]
task181_outcome_extraction information extraction Rouge-L
-
[15]
task748_glucose_reverse_cause_event_detection information extraction Rouge-L
-
[16]
task1510_evalution_relation_extraction information extraction Rouge-L
-
[17]
task002_quoref_answer_generation question answering Rouge-L
-
[18]
task073_commonsenseqa_answer_generation question answering Rouge-L
-
[19]
task591_sciq_answer_generation question answering Rouge-L
-
[20]
task511_reddit_tifu_long_text_summarization summarization Rouge-L
-
[21]
task1290_xsum_summarization summarization Rouge-L
-
[22]
task1572_samsum_summary summarization Rouge-L
-
[23]
task363_sst2_polarity_classification sentiment analysis accuracy
-
[24]
task875_emotion_classification sentiment analysis accuracy
-
[25]
Dataset name Category Task Domain Metric
task1687_sentiment140_classification sentiment analysis accuracy Table 7: The details of 15 datasets in the SuperNI Benchmark (Wang et al., 2022). Dataset name Category Task Domain Metric
work page 2022
-
[26]
Yelp CL Benchmark sentiment analysis Yelp reviews accuracy
-
[27]
Amazon CL Benchmark sentiment analysis Amazon reviews accuracy
-
[28]
DBpedia CL Benchmark topic classification Wikipedia accuracy
-
[29]
Yahoo CL Benchmark topic classification Yahoo Q&A accuracy
-
[30]
AG News CL Benchmark topic classification news accuracy
-
[31]
MNLI GLUE natural language inference various accuracy
-
[32]
QQP GLUE paragraph detection Quora accuracy
-
[33]
RTE GLUE natural language inference news, Wikipedia accuracy
-
[34]
SST-2 GLUE sentiment analysis movie reviews accuracy
-
[35]
WiC SuperGLUE word sense disambiguation lexical databases accuracy
-
[36]
CB SuperGLUE natural language inference various accuracy
-
[37]
COPA SuperGLUE question and answering blogs, encyclopedia accuracy
-
[38]
BoolQA SuperGLUE boolean question and answering Wikipedia accuracy
-
[39]
MultiRC SuperGLUE question and answering various accuracy
-
[40]
First five tasks correspond to the standard CL benchmark (Zhang et al., 2015)
IMDB SuperGLUE sentiment analysis movie reviews accuracy Table 8: The details of 15 classification datasets in the Long Sequence Benchmark (Razdai et al., 2022). First five tasks correspond to the standard CL benchmark (Zhang et al., 2015). Order Benchmark Task Sequence 1 Standard CL dbpedia→amazon→yahoo→ag 2 dbpedia→amazon→ag→yahoo 3 yahoo→amazon→ag→dbpe...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.