FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning

Hao Wang; Jian Li; Philip S. Yu; Xiao-Ming Wu; Xu Chu; Yasha Wang; Yiran Liu; Yujie Feng; Zhaolu Kang

arxiv: 2601.03938 · v2 · submitted 2026-01-07 · 💻 cs.LG · cs.AI· cs.CL

FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning

Yujie Feng , Hao Wang , Jian Li , Xu Chu , Zhaolu Kang , Yiran Liu , Yasha Wang , Philip S. Yu

show 1 more author

Xiao-Ming Wu

This is my paper

Pith reviewed 2026-05-16 16:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords continual learningcatastrophic forgettingmemory replayforgetting curvelarge language modelsoptimizer updatesEbbinghaus forgetting

0 comments

The pith

FOREVER schedules memory replays for LLMs using a forgetting curve timed by optimizer update magnitude instead of training steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a continual learning approach for large language models that replays selected past examples at intervals drawn from the Ebbinghaus forgetting curve. It replaces fixed step counts with a model-centric clock based on the size of parameter updates during optimization. This change lets replay timing and intensity track how much the model has actually changed rather than how many batches have passed. Experiments on three benchmarks with models from 0.6B to 13B parameters show reduced forgetting while new tasks are learned. A reader would care because it offers a direct way to keep earlier knowledge alive when an LLM must absorb new data in sequence.

Core claim

FOREVER defines model time from the magnitude of optimizer updates, applies a forgetting-curve scheduler to decide when to replay and how intensely, and adds intensity-aware regularization; the result is consistent mitigation of catastrophic forgetting on three continual-learning benchmarks across models ranging from 0.6B to 13B parameters.

What carries the argument

The forgetting curve-based replay scheduler that treats optimizer update magnitude as model time to set replay intervals and regularization strength.

If this is right

Replay decisions become responsive to the model's actual parameter drift rather than external step counts.
Regularization strength scales automatically with the predicted forgetting rate at each replay point.
The same scheduler works across model scales from hundreds of millions to tens of billions of parameters.
Unnecessary replays at early stages of stability are reduced, freeing compute for new data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same update-magnitude clock could be tested on non-LLM architectures where forgetting follows a comparable curve.
If the proxy holds, replay frequency could be lowered further in late training without extra forgetting.
The approach invites direct measurement of update magnitude versus forgetting on streaming data tasks outside the three benchmarks.

Load-bearing premise

The magnitude of optimizer updates provides a faithful proxy for the model's internal time progression that matches the shape of the Ebbinghaus forgetting curve.

What would settle it

A controlled run in which the correlation between update magnitudes and measured forgetting rates on a held-out task is near zero, or in which fixed-step replay matches FOREVER performance on the same benchmarks.

Figures

Figures reproduced from arXiv: 2601.03938 by Hao Wang, Jian Li, Philip S. Yu, Xiao-Ming Wu, Xu Chu, Yasha Wang, Yiran Liu, Yujie Feng, Zhaolu Kang.

**Figure 1.** Figure 1: Aligning human time and model time in FOREVER. FOREVER aligns Ebbinghaus-inspired human replay intervals with a model-centric timeline defined by accumulated parameter update magnitude, enabling replay to be triggered based on the model’s actual learning progress. representations, disrupting the stability–plasticity trade-off and resulting in catastrophic forgetting (CF) (McCloskey and Cohen, 1989). Replay… view at source ↗

**Figure 2.** Figure 2: Overview of FOREVER. FOREVER decomposes replay into two coupled decisions—when to replay and how to replay—both grounded in model update dynamics. Parameter update magnitudes ∆t track model evolution over training steps, whose accumulation defines a model-centric notion of time (virtual “model days”). When to replay (Left): accumulated model time τt measures how far the model has progressed in parameter sp… view at source ↗

**Figure 3.** Figure 3: Performance of FOREVER with different backbones on the SuperNI Benchmark. FOREVER Generalizes Consistently Across Model Scales. We further evaluate FOREVER across backbone models ranging from 0.6B to 13B parameters, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Visualization of model-centric replay dynamics during training. Left: step-wise parameter update magnitude ∆t across training steps. Right: accumulated model-centric time τt with replay trigger points annotated. Under the proposed model-centric time definition, replay is triggered at different training steps for different tasks, reflecting task-dependent learning dynamics rather than fixed step-based sc… view at source ↗

**Figure 6.** Figure 6: Visualization of task loss (L old task) on memory samples and scaled replay regularization loss (βbase · Lreg) at replay stages. we record (i) the task loss on memory samples at the end of replay, and (ii) the scaled regularization loss applied during replay [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Replay dynamics across datasets and task orders. Each subfigure shows ∆t, accumulated model time τt, and replay trigger points for one task order, illustrating adaptive replay scheduling based on model-centric time [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Continual learning (CL) for large language models (LLMs) aims to enable sequential knowledge acquisition without catastrophic forgetting. Memory replay methods are widely used for their practicality and effectiveness, but most rely on fixed, step-based heuristics that often misalign with the model's actual learning progress, since identical training steps can result in varying degrees of parameter change. Motivated by recent findings that LLM forgetting mirrors the Ebbinghaus human forgetting curve, we propose FOREVER (FORgEtting curVe-inspired mEmory Replay), a novel CL framework that aligns replay schedules with a model-centric notion of time. FOREVER defines model time using the magnitude of optimizer updates, allowing forgetting curve-inspired replay intervals to align with the model's internal evolution rather than raw training steps. Building on this approach, FOREVER incorporates a forgetting curve-based replay scheduler to determine when to replay and an intensity-aware regularization mechanism to adaptively control how to replay. Extensive experiments on three CL benchmarks and models ranging from 0.6B to 13B parameters demonstrate that FOREVER consistently mitigates catastrophic forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FOREVER ties replay to optimizer update magnitude as model time and an Ebbinghaus scheduler, with reported gains across benchmarks, but the edge depends on that proxy actually tracking forgetting better than steps.

read the letter

The main new piece is measuring model time by cumulative optimizer update norms instead of raw steps, then using that to set replay intervals from an Ebbinghaus forgetting curve plus an intensity regularization term. The abstract says this beats fixed-step replay on three CL benchmarks for models from 0.6B to 13B, which is a reasonable scope for the claim. The experiments at least cover scale, which many continual learning papers skip, and the mechanism looks internally consistent without obvious circularity in the math. If the full results include solid ablations that separate the scheduler from the extra regularization, the gains could be real and useful for production LLM adaptation pipelines. The soft spot is the proxy itself. The paper motivates from recent findings on LLM forgetting curves, but it is not clear from the abstract whether they checked that update magnitude correlates with actual prior-task loss rise on these specific models and tasks. If the correlation is weak or task-dependent, the improvements might come from replay frequency or regularization alone rather than curve alignment, which would limit how far the idea travels. The stress-test note flags exactly this, and without seeing the verification details it stays a plausible but unconfirmed assumption. This is the sort of targeted engineering paper that deserves referee time to check the controls and generalization. I'd send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FOREVER, a continual learning framework for LLMs that defines a model-centric notion of time via the cumulative magnitude of optimizer updates, uses this to schedule memory replay according to an Ebbinghaus-inspired forgetting curve, and adds an intensity-aware regularization term to control replay strength. It reports that the method consistently reduces catastrophic forgetting relative to baselines across three CL benchmarks and model scales from 0.6B to 13B parameters.

Significance. If the update-magnitude proxy is shown to track actual forgetting dynamics, the work would offer a principled alternative to fixed-step replay heuristics, potentially improving sample efficiency in LLM continual learning. The breadth of evaluation across model sizes is a positive feature, but the central claim rests on an unverified alignment between ||Δθ|| and task-loss rise, which limits immediate impact.

major comments (2)

[§3.2] §3.2 (model-time definition): the claim that cumulative optimizer-update norms provide a faithful proxy for Ebbinghaus-style forgetting requires explicit validation (e.g., a correlation plot or table showing alignment between ||Δθ||_cum and rise in loss on held-out prior tasks); without it, performance gains cannot be confidently attributed to curve alignment rather than replay frequency or the added regularization term alone.
[§4] §4 (experimental results): the abstract and results section report consistent gains on three benchmarks, yet no error bars, statistical significance tests, or full ablations isolating the scheduler from the intensity regularizer are described; this makes it impossible to verify robustness across the 0.6B–13B range or rule out that simpler step-based replay with the same regularization would suffice.

minor comments (2)

[Abstract] Abstract: the reference to “recent findings” on LLM forgetting mirroring the Ebbinghaus curve should include a specific citation rather than remaining generic.
[§3.1] Notation: the symbol for cumulative update magnitude is introduced without an explicit equation label, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve the validation of our core claims and the rigor of the experimental reporting.

read point-by-point responses

Referee: [§3.2] the claim that cumulative optimizer-update norms provide a faithful proxy for Ebbinghaus-style forgetting requires explicit validation (e.g., a correlation plot or table showing alignment between ||Δθ||_cum and rise in loss on held-out prior tasks); without it, performance gains cannot be confidently attributed to curve alignment rather than replay frequency or the added regularization term alone.

Authors: We agree that explicit validation of the update-magnitude proxy is necessary to strengthen attribution of gains to the forgetting-curve alignment. Although the model-centric time definition is motivated by prior observations of Ebbinghaus-like forgetting in LLMs, we will add a dedicated analysis (new figure and table) in the revised §3.2. This will report Pearson correlations and a scatter plot between cumulative ||Δθ|| and the rise in held-out loss on prior tasks across training stages, directly addressing the concern. revision: yes
Referee: [§4] the abstract and results section report consistent gains on three benchmarks, yet no error bars, statistical significance tests, or full ablations isolating the scheduler from the intensity regularizer are described; this makes it impossible to verify robustness across the 0.6B–13B range or rule out that simpler step-based replay with the same regularization would suffice.

Authors: We acknowledge the current lack of error bars, significance testing, and isolating ablations. In the revision we will (i) report mean ± standard deviation across at least three random seeds for all metrics, (ii) add paired t-test p-values for the main comparisons, and (iii) expand §4 with a full ablation that includes a step-based replay baseline equipped with the identical intensity regularizer. These additions will allow readers to assess robustness across model scales and isolate the contribution of the scheduler. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proxy and external motivation remain independent of target metrics

full rationale

The derivation defines model-centric time directly as the cumulative magnitude of optimizer updates (a measurable input quantity) and applies replay intervals drawn from an Ebbinghaus functional form whose motivation is attributed to external prior findings on LLM forgetting. No equation fits a parameter to the evaluation forgetting metric and then re-uses that fit as the 'prediction' or scheduler; the intensity regularization adapts replay strength from the same proxy without tautological reduction. No self-citation is shown to be load-bearing for the central claim, and the chain does not rename or smuggle an ansatz that collapses to the input data. The reported gains therefore rest on an independent proxy rather than on constructional equivalence to the benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that LLM forgetting follows the Ebbinghaus curve shape and that update magnitude serves as a suitable time axis; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption LLM forgetting mirrors the Ebbinghaus human forgetting curve
Stated as motivation from recent findings; used to justify the replay scheduler shape.
domain assumption Magnitude of optimizer updates defines a faithful model-centric time
Central to replacing step-based heuristics with update-magnitude intervals.

pith-pipeline@v0.9.0 · 5516 in / 1254 out tokens · 43149 ms · 2026-05-16T16:26:19.194359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FOREVER defines model time using the magnitude of optimizer updates... τ_t = Σ Δ_i where Δ_t = ||Θ_t − Θ_{t−1}||_2
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ebbinghaus-guided replay schedule on model time... D_model = {d · τ_day}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems
cs.AI 2026-04 unverdicted novelty 7.0

SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.
Not All Memories Age the Same: Autodiscovery of Adaptive Decay in Knowledge Graphs
cs.IR 2026-04 unverdicted novelty 6.0

Knowledge graphs should use data-driven hierarchical decay surfaces based on velocity and volatility instead of uniform forgetting curves to better identify currently relevant facts.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 3 Pith papers · 1 internal anchor

[1]

InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 17167–17186

Mitigating catastrophic forgetting in language transfer via model merging. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 17167–17186. Andrew Bai, Chih-Kuan Yeh, Cho-Jui Hsieh, and Ankur Taly. 2025. An efficient rehearsal scheme for catas- trophic forgetting mitigation during multi-stage fine- tuning.Preprint, arXiv:2402....

work page arXiv 2024
[2]

arXiv preprint arXiv:2503.01595 (2025)

LoRAMoE: Alleviating world knowledge for- getting in large language models via MoE-style plu- gin. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1932–1945, Bangkok, Thailand. Association for Computational Linguistics. Wenyu Du, Shuang Cheng, Tongxu Luo, Zihan Qiu, Zeyu Huang, Ka ...

work page arXiv 1932
[3]

Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024

Interpretable catastrophic forgetting of large language model fine-tuning via instruction vector. arXiv preprint arXiv:2406.12227. Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei

work page arXiv
[4]

Hankyul Kang, Gregor Seifer, Donghyun Lee, and Jong- bin Ryu

Unlocking the power of function vectors for characterizing and mitigating catastrophic forget- ting in continual instruction tuning.arXiv preprint arXiv:2502.11019. Hankyul Kang, Gregor Seifer, Donghyun Lee, and Jong- bin Ryu. 2025a. Do your best and get enough rest for continual learning. InProceedings of the Computer Vision and Pattern Recognition Confe...

work page arXiv 2022
[5]

Zixuan Ke, Bing Liu, Wenhan Xiong, Asli Celikyilmaz, and Haoran Li

Achieving forgetting prevention and knowl- edge transfer in continual learning.Advances in Neural Information Processing Systems, 34:22443– 22456. Zixuan Ke, Bing Liu, Wenhan Xiong, Asli Celikyilmaz, and Haoran Li. 2023. Sub-network discovery and soft-masking for continual learning of mixed tasks. arXiv preprint arXiv:2310.09436. James Kirkpatrick, Razvan...

work page arXiv 2023
[6]

InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 18489–18504

Dynamic expert specialization: Towards catas- trophic forgetting-free multi-domain moe adaptation. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 18489–18504. Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, and 1 others

work page 2025
[7]

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Advances and challenges in foundation agents: From brain-inspired intelligence to evolution- ary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990. Aojun Lu, Hangjie Yuan, Tao Feng, and Yanan Sun

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2506.03951 , year=

Rethinking the stability-plasticity trade-off in continual learning from an architectural perspective. Preprint, arXiv:2506.03951. Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. 2024. Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models.arXiv preprint arXiv:...

work page arXiv 2024
[9]

Razdaibiedina, Y

Progressive prompts: Continual learning for language models.arXiv preprint arXiv:2301.12314. Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. 2024. Analyzing and reducing catas- trophic forgetting in parameter efficient tuning.arXiv preprint arXiv:2402.18865. Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, ...

work page arXiv 2024
[10]

arXiv preprint arXiv:2403.18886 (2024) 26

Self-expansion of pre-trained models with mixture of adapters for continual learning.arXiv preprint arXiv:2403.18886. Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. 2023. Orthogonal subspace learning for lan- guage model continual learning. InFindings of the Association for Computational Linguistic...

work page arXiv 2023
[11]

task639_multi_woz_user_utterance_generation dialogue generation Rouge-L

work page
[12]

task1590_diplomacy_text_generation dialogue generation Rouge-L

work page
[13]

task1729_personachat_generate_next dialogue generation Rouge-L

work page
[14]

task181_outcome_extraction information extraction Rouge-L

work page
[15]

task748_glucose_reverse_cause_event_detection information extraction Rouge-L

work page
[16]

task1510_evalution_relation_extraction information extraction Rouge-L

work page
[17]

task002_quoref_answer_generation question answering Rouge-L

work page
[18]

task073_commonsenseqa_answer_generation question answering Rouge-L

work page
[19]

task591_sciq_answer_generation question answering Rouge-L

work page
[20]

task511_reddit_tifu_long_text_summarization summarization Rouge-L

work page
[21]

task1290_xsum_summarization summarization Rouge-L

work page
[22]

task1572_samsum_summary summarization Rouge-L

work page
[23]

task363_sst2_polarity_classification sentiment analysis accuracy

work page
[24]

task875_emotion_classification sentiment analysis accuracy

work page
[25]

Dataset name Category Task Domain Metric

task1687_sentiment140_classification sentiment analysis accuracy Table 7: The details of 15 datasets in the SuperNI Benchmark (Wang et al., 2022). Dataset name Category Task Domain Metric

work page 2022
[26]

Yelp CL Benchmark sentiment analysis Yelp reviews accuracy

work page
[27]

Amazon CL Benchmark sentiment analysis Amazon reviews accuracy

work page
[28]

DBpedia CL Benchmark topic classification Wikipedia accuracy

work page
[29]

Yahoo CL Benchmark topic classification Yahoo Q&A accuracy

work page
[30]

AG News CL Benchmark topic classification news accuracy

work page
[31]

MNLI GLUE natural language inference various accuracy

work page
[32]

QQP GLUE paragraph detection Quora accuracy

work page
[33]

RTE GLUE natural language inference news, Wikipedia accuracy

work page
[34]

SST-2 GLUE sentiment analysis movie reviews accuracy

work page
[35]

WiC SuperGLUE word sense disambiguation lexical databases accuracy

work page
[36]

CB SuperGLUE natural language inference various accuracy

work page
[37]

COPA SuperGLUE question and answering blogs, encyclopedia accuracy

work page
[38]

BoolQA SuperGLUE boolean question and answering Wikipedia accuracy

work page
[39]

MultiRC SuperGLUE question and answering various accuracy

work page
[40]

First five tasks correspond to the standard CL benchmark (Zhang et al., 2015)

IMDB SuperGLUE sentiment analysis movie reviews accuracy Table 8: The details of 15 classification datasets in the Long Sequence Benchmark (Razdai et al., 2022). First five tasks correspond to the standard CL benchmark (Zhang et al., 2015). Order Benchmark Task Sequence 1 Standard CL dbpedia→amazon→yahoo→ag 2 dbpedia→amazon→ag→yahoo 3 yahoo→amazon→ag→dbpe...

work page 2022

[1] [1]

InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 17167–17186

Mitigating catastrophic forgetting in language transfer via model merging. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 17167–17186. Andrew Bai, Chih-Kuan Yeh, Cho-Jui Hsieh, and Ankur Taly. 2025. An efficient rehearsal scheme for catas- trophic forgetting mitigation during multi-stage fine- tuning.Preprint, arXiv:2402....

work page arXiv 2024

[2] [2]

arXiv preprint arXiv:2503.01595 (2025)

LoRAMoE: Alleviating world knowledge for- getting in large language models via MoE-style plu- gin. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1932–1945, Bangkok, Thailand. Association for Computational Linguistics. Wenyu Du, Shuang Cheng, Tongxu Luo, Zihan Qiu, Zeyu Huang, Ka ...

work page arXiv 1932

[3] [3]

Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024

Interpretable catastrophic forgetting of large language model fine-tuning via instruction vector. arXiv preprint arXiv:2406.12227. Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei

work page arXiv

[4] [4]

Hankyul Kang, Gregor Seifer, Donghyun Lee, and Jong- bin Ryu

Unlocking the power of function vectors for characterizing and mitigating catastrophic forget- ting in continual instruction tuning.arXiv preprint arXiv:2502.11019. Hankyul Kang, Gregor Seifer, Donghyun Lee, and Jong- bin Ryu. 2025a. Do your best and get enough rest for continual learning. InProceedings of the Computer Vision and Pattern Recognition Confe...

work page arXiv 2022

[5] [5]

Zixuan Ke, Bing Liu, Wenhan Xiong, Asli Celikyilmaz, and Haoran Li

Achieving forgetting prevention and knowl- edge transfer in continual learning.Advances in Neural Information Processing Systems, 34:22443– 22456. Zixuan Ke, Bing Liu, Wenhan Xiong, Asli Celikyilmaz, and Haoran Li. 2023. Sub-network discovery and soft-masking for continual learning of mixed tasks. arXiv preprint arXiv:2310.09436. James Kirkpatrick, Razvan...

work page arXiv 2023

[6] [6]

InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 18489–18504

Dynamic expert specialization: Towards catas- trophic forgetting-free multi-domain moe adaptation. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 18489–18504. Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, and 1 others

work page 2025

[7] [7]

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Advances and challenges in foundation agents: From brain-inspired intelligence to evolution- ary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990. Aojun Lu, Hangjie Yuan, Tao Feng, and Yanan Sun

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2506.03951 , year=

Rethinking the stability-plasticity trade-off in continual learning from an architectural perspective. Preprint, arXiv:2506.03951. Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. 2024. Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models.arXiv preprint arXiv:...

work page arXiv 2024

[9] [9]

Razdaibiedina, Y

Progressive prompts: Continual learning for language models.arXiv preprint arXiv:2301.12314. Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. 2024. Analyzing and reducing catas- trophic forgetting in parameter efficient tuning.arXiv preprint arXiv:2402.18865. Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, ...

work page arXiv 2024

[10] [10]

arXiv preprint arXiv:2403.18886 (2024) 26

Self-expansion of pre-trained models with mixture of adapters for continual learning.arXiv preprint arXiv:2403.18886. Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. 2023. Orthogonal subspace learning for lan- guage model continual learning. InFindings of the Association for Computational Linguistic...

work page arXiv 2023

[11] [11]

task639_multi_woz_user_utterance_generation dialogue generation Rouge-L

work page

[12] [12]

task1590_diplomacy_text_generation dialogue generation Rouge-L

work page

[13] [13]

task1729_personachat_generate_next dialogue generation Rouge-L

work page

[14] [14]

task181_outcome_extraction information extraction Rouge-L

work page

[15] [15]

task748_glucose_reverse_cause_event_detection information extraction Rouge-L

work page

[16] [16]

task1510_evalution_relation_extraction information extraction Rouge-L

work page

[17] [17]

task002_quoref_answer_generation question answering Rouge-L

work page

[18] [18]

task073_commonsenseqa_answer_generation question answering Rouge-L

work page

[19] [19]

task591_sciq_answer_generation question answering Rouge-L

work page

[20] [20]

task511_reddit_tifu_long_text_summarization summarization Rouge-L

work page

[21] [21]

task1290_xsum_summarization summarization Rouge-L

work page

[22] [22]

task1572_samsum_summary summarization Rouge-L

work page

[23] [23]

task363_sst2_polarity_classification sentiment analysis accuracy

work page

[24] [24]

task875_emotion_classification sentiment analysis accuracy

work page

[25] [25]

Dataset name Category Task Domain Metric

task1687_sentiment140_classification sentiment analysis accuracy Table 7: The details of 15 datasets in the SuperNI Benchmark (Wang et al., 2022). Dataset name Category Task Domain Metric

work page 2022

[26] [26]

Yelp CL Benchmark sentiment analysis Yelp reviews accuracy

work page

[27] [27]

Amazon CL Benchmark sentiment analysis Amazon reviews accuracy

work page

[28] [28]

DBpedia CL Benchmark topic classification Wikipedia accuracy

work page

[29] [29]

Yahoo CL Benchmark topic classification Yahoo Q&A accuracy

work page

[30] [30]

AG News CL Benchmark topic classification news accuracy

work page

[31] [31]

MNLI GLUE natural language inference various accuracy

work page

[32] [32]

QQP GLUE paragraph detection Quora accuracy

work page

[33] [33]

RTE GLUE natural language inference news, Wikipedia accuracy

work page

[34] [34]

SST-2 GLUE sentiment analysis movie reviews accuracy

work page

[35] [35]

WiC SuperGLUE word sense disambiguation lexical databases accuracy

work page

[36] [36]

CB SuperGLUE natural language inference various accuracy

work page

[37] [37]

COPA SuperGLUE question and answering blogs, encyclopedia accuracy

work page

[38] [38]

BoolQA SuperGLUE boolean question and answering Wikipedia accuracy

work page

[39] [39]

MultiRC SuperGLUE question and answering various accuracy

work page

[40] [40]

First five tasks correspond to the standard CL benchmark (Zhang et al., 2015)

IMDB SuperGLUE sentiment analysis movie reviews accuracy Table 8: The details of 15 classification datasets in the Long Sequence Benchmark (Razdai et al., 2022). First five tasks correspond to the standard CL benchmark (Zhang et al., 2015). Order Benchmark Task Sequence 1 Standard CL dbpedia→amazon→yahoo→ag 2 dbpedia→amazon→ag→yahoo 3 yahoo→amazon→ag→dbpe...

work page 2022