Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting

Arpit Garg; Hemanth Saratchandran; Runze Xu; Simon Lucey

arxiv: 2605.29498 · v1 · pith:YCUDU5ZFnew · submitted 2026-05-28 · 💻 cs.CL · cs.CV

Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting

Runze Xu , Arpit Garg , Hemanth Saratchandran , Simon Lucey This is my paper

Pith reviewed 2026-06-29 07:34 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords LoRAcatastrophic forgettingregularizationlarge language modelsfine-tuningKL divergencereplay-free adaptation

0 comments

The pith

Removing the ground-truth token before KL regularization lets LoRA adapt new tasks while better preserving base model preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard LoRA fine-tuning degrades prior capabilities especially when the adaptation data differs from the original training or alignment distribution, and that original data is usually unavailable for replay. It proposes a regularizer applied only at the loss level that drops the ground-truth token from both the base and adapted output distributions, renormalizes the remaining probabilities, and computes KL divergence solely over the non-target vocabulary. This setup is intended to keep the base model's relative ordering among alternatives while the cross-entropy term still drives adaptation on the target. Experiments across LoRA variants and backbones indicate the approach shifts the observed trade-off toward less forgetting without requiring architectural changes or replay buffers.

Core claim

By excluding the ground-truth token from both distributions, renormalizing the rest, and restricting KL regularization to the non-target vocabulary, the method maintains the base model's relative preferences among alternative tokens and thereby reduces forgetting during LoRA adaptation on distributions that differ substantially from the original training data.

What carries the argument

Target-masked KL regularizer: drops the ground-truth token from base and adapted softmax outputs, renormalizes the remaining probabilities, and applies KL only over those non-target entries.

If this is right

The regularizer improves the new-learning versus forgetting frontier across tested LoRA variants and model backbones when adaptation distributions differ substantially from pretraining.
No replay data, model architecture changes, or inference-time cost is required because the regularizer operates only at the loss level.
The same plug-in can be added to any existing LoRA training pipeline without redesigning adapters.
Forgetting reduction holds in replay-free settings where original training or alignment data cannot be accessed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that direct opposition on the target token itself is a primary source of forgetting in standard output-space regularization.
The masking step could be tested on full fine-tuning or other parameter-efficient methods to check whether the same output-space intervention generalizes.
Combining the regularizer with limited replay when some original data is available might produce additive gains, though the paper does not examine this case.
The approach suggests that lightweight output interventions can be a practical lever for controlling knowledge retention during LLM updates.

Load-bearing premise

Removing the ground-truth token and renormalizing leaves the base model's relative preferences among the remaining tokens unchanged and does not oppose the adaptation signal from cross-entropy loss.

What would settle it

On a benchmark with highly divergent adaptation data, compare base-model retention metrics (for example, performance on held-out original tasks) between standard LoRA and the regularized version; if the regularized version shows no improvement or worse retention, the claim is falsified.

Figures

Figures reproduced from arXiv: 2605.29498 by Arpit Garg, Hemanth Saratchandran, Runze Xu, Simon Lucey.

**Figure 1.** Figure 1: Overview of Target-Masked KL regularization. At each supervised token position, the frozen base model produces a next-token distribution pbase over the vocabulary, and the LoRAadapted model produces a next-token distribution padapted for the same context. Cross-entropy is computed on the supervised target token y exactly as in standard LoRA fine-tuning. Target-Masked KL adds a second term: it removes the … view at source ↗

**Figure 2.** Figure 2: Headline result on Qwen2.5-0.5B → OpenR1-Math. Mean over three seeds. Baseline (CE) is plain cross-entropy fine-tuning, the standard LoRA training objective; CE + TMKL adds Target-Masked KL (λ=1) to the same training run. Grey bars are the baseline; coloured bars are TMKL. (a) Target adaptation: change in target perplexity after adaptation; more negative is better learning. TMKL improves target adaptation … view at source ↗

**Figure 3.** Figure 3: Two ablations on Qwen2.5-0.5B → OpenR1-Math (single seed). (a) Drift-prevention is monotone in λ for both LoRA (blue) and SineLoRA (red); the curves overlap to within a few pp, so the shape is a property of the loss not the adapter. We use λ=1 throughout. (b) Held-out non-target KL between base and adapted next-token distributions on the OpenR1-Math test split. CE (grey) pushes the distance to 0.50 to 0.81… view at source ↗

read the original abstract

Low-Rank Adaptation (LoRA) has become one of the most widely used fine-tuning mechanisms for adapting large language models to new domains, tasks, and users. Yet adaptation performance alone can obscure an important failure mode: LoRA updates may improve performance on the target distribution while degrading prior capabilities learned during pretraining and alignment. We show that this forgetting becomes especially severe when the adaptation distribution differs substantially from the models original training or alignment distributions. The challenge is amplified in practical settings, where the original training and alignment data are typically unavailable. Motivated by this constraint, we study how LoRA based adaptation balances new learning against forgetting in a replay-free setting, and introduce a simple output space regularizer that can be added directly to existing training pipelines. Our method removes the ground-truth token from both the base and adapted model distributions, renormalizes the remaining probabilities, and applies KL regularization only over the non-target vocabulary. This preserves the base models relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation. As the regularizer acts only at the loss level, it requires no replay data, architectural changes, adapter redesign, or inference-time overhead, and can be applied directly to existing LoRA variants. Across all LoRA variants tested and across various backbones, our method improves the frontier between new learning and forgetting when the adaptation distribution differs substantially from the base models original training or alignment distributions, suggesting a broadly applicable route toward more reliable LLM updating.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The target-masking KL regularizer is a straightforward plug-in for LoRA but the claim of no interference with the cross-entropy gradient rests on an unverified assumption.

read the letter

The paper introduces a specific output-space regularizer: drop the ground-truth token from both base and adapted distributions, renormalize the rest, and run KL only on the non-target tokens. This combination is new enough to stand as a distinct method for replay-free LoRA adaptation.

It does one thing cleanly: it adds to any existing LoRA training loop with no extra data, no architecture changes, and no inference cost. The abstract reports gains on the learning-forgetting trade-off across several LoRA variants and backbones when the adaptation data is far from the original distribution. That matches a real deployment pain point.

The main soft spot is the gradient coupling the stress-test note flags. Renormalization makes every q_a(v) depend on p_a(t), so the KL term has a nonzero derivative with respect to the target logit. The paper states the construction “preserves the base model’s relative preferences among alternatives without directly opposing the cross-entropy signal,” yet the abstract supplies no derivative expansion or ablation that isolates this effect. If the full paper contains that analysis or shows the interference is negligible in practice, the claim holds; otherwise it is an open question.

The rest of the setup looks standard and non-circular. No invented metrics or self-referential evaluation.

This is for people who fine-tune LLMs with LoRA and care about capability retention when data is scarce. It is worth sending to peer review so the gradient question and the experimental controls can be checked properly.

Referee Report

2 major / 0 minor

Summary. The paper proposes a plug-and-play output-space regularizer for LoRA fine-tuning of LLMs. The method masks the ground-truth target token from the base and adapted next-token distributions, renormalizes the remaining probabilities, and applies KL divergence only over the non-target vocabulary. This is claimed to mitigate catastrophic forgetting of pretraining and alignment capabilities in a replay-free setting, improving the learning-forgetting frontier when adaptation data differs substantially from the original distributions. The regularizer requires no architectural changes, replay data, or inference overhead and is asserted to act without directly opposing the cross-entropy adaptation signal.

Significance. If the empirical claims hold after verification, the approach supplies a lightweight, broadly applicable loss-level addition to existing LoRA pipelines that addresses a common practical failure mode in LLM adaptation without replay or redesign. The simplicity and compatibility with multiple LoRA variants and backbones would make it a useful contribution to reliable model updating.

major comments (2)

[Abstract] Abstract (method description): The claim that the regularizer 'preserves the base model’s relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation' lacks supporting analysis. Defining q_b(v) = p_b(v)/(1-p_b(t)) and q_a(v) = p_a(v)/(1-p_a(t)) for v ≠ t makes the KL(q_b || q_a) term an explicit function of p_a(t); its gradient with respect to the target logit is therefore nonzero in general and can couple to the CE loss. No derivative expansion or ablation isolating this effect is referenced.
[Abstract] Abstract (empirical claim): The statement that the method 'improves the frontier between new learning and forgetting' across 'all LoRA variants tested and across various backbones' is presented without reference to statistical significance, number of random seeds, hyperparameter controls, or ablation isolating the regularizer from other training choices. These details are load-bearing for the central empirical claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify areas where additional rigor can strengthen the presentation of both the method and the empirical results. We address each major comment below and will incorporate the requested clarifications and analyses in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (method description): The claim that the regularizer 'preserves the base model’s relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation' lacks supporting analysis. Defining q_b(v) = p_b(v)/(1-p_b(t)) and q_a(v) = p_a(v)/(1-p_a(t)) for v ≠ t makes the KL(q_b || q_a) term an explicit function of p_a(t); its gradient with respect to the target logit is therefore nonzero in general and can couple to the CE loss. No derivative expansion or ablation isolating this effect is referenced.

Authors: We acknowledge the referee's observation that the normalization factor introduces a dependence on p_a(t), so the gradient of the KL term w.r.t. the target logit is nonzero. The original phrasing intended to convey that the regularizer operates on the conditional distribution over non-target tokens and therefore does not directly penalize increases in the target probability (the primary adaptation signal remains the CE loss). Nevertheless, we agree that a precise characterization of the coupling is needed. In the revision we will add (i) an explicit derivative expansion of the regularizer gradient w.r.t. the target logit and (ii) an ablation that isolates the contribution of this term from the CE signal. revision: yes
Referee: [Abstract] Abstract (empirical claim): The statement that the method 'improves the frontier between new learning and forgetting' across 'all LoRA variants tested and across various backbones' is presented without reference to statistical significance, number of random seeds, hyperparameter controls, or ablation isolating the regularizer from other training choices. These details are load-bearing for the central empirical claim.

Authors: We agree that the central empirical claim requires stronger statistical grounding. In the revised manuscript we will report results aggregated over multiple random seeds, include statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values), document the hyperparameter search protocol and controls, and add an ablation that isolates the regularizer from other training decisions such as learning-rate schedules and LoRA rank choices. revision: yes

Circularity Check

0 steps flagged

No circularity: regularizer defined independently; empirical claims not reduced to inputs by construction

full rationale

The paper defines its output-space regularizer explicitly by masking the ground-truth token, renormalizing the remaining distribution, and applying KL divergence only on non-target tokens. This construction is stated as an independent addition to the cross-entropy loss and does not reference or depend on the downstream evaluation metrics for new learning or forgetting. No derivation chain equates a claimed result to a fitted parameter or self-citation; the improvement is presented as an empirical observation across LoRA variants rather than a mathematical identity. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the text. The central premise therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are identifiable from the provided text. The approach relies on standard probability renormalization and KL divergence.

pith-pipeline@v0.9.1-grok · 5806 in / 1066 out tokens · 19055 ms · 2026-06-29T07:34:17.038851+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 43 canonical work pages · 11 internal anchors

[1]

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors

Imtiaz Ahmed, Sadman Islam, Partha Protim Datta, Imran Kabir, Md Naseef Ur Rahman Chowdhury, and Ahshanul Haque. Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors. 2025

2025
[2]

Zhang, Hemanth Saratchandran, Anton van den Hengel, and Ehsan Abbasnejad

Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Anton van den Hengel, and Ehsan Abbasnejad. Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri–Rao Product, August 2025. URLhttp://arxiv.org/abs/2508.00230. arXiv:2508.00230 [cs]

work page arXiv 2025
[3]

Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad

Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. RandLoRA: Full-rank parameter-efficient fine-tuning of large models, March 2025. URLhttp://arxiv.org/abs/2502.00987. arXiv:2502.00987 [cs]

work page arXiv 2025
[4]

PLD: A Choice-Theoretic List-Wise Knowledge Distillation,

Ejafa Bassam, Dawei Zhu, and Kaigui Bian. PLD: A Choice-Theoretic List-Wise Knowledge Distillation,
[5]

PLD: A Choice-Theoretic List-Wise Knowledge Distillation

URLhttps://arxiv.org/abs/2506.12542. Version Number: 3

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models.arXiv preprint arXiv:2106.10199, 2021

Elad Ben-Zaken, Shauli Ravfogel, and Yoav Goldberg. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models, September 2022. URL http://arxiv.org/abs/2106. 10199. arXiv:2106.10199 [cs]

work page arXiv 2022
[7]

Cunningham

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John P. Cunningham. LoRA Learns Less and Forgets Less, September 2024. URL http://arxiv.org/abs/ 2405.09673. arXiv:2405.09673 [cs] version: 2

work page arXiv 2024
[8]

Dark Experience for General Continual Learning: a Strong, Simple Baseline, October 2020

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark Experience for General Continual Learning: a Strong, Simple Baseline, October 2020. URL http://arxiv.org/ abs/2004.07211. arXiv:2004.07211 [stat]

work page arXiv 2020
[9]

Efficient Lifelong Learning with A-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient Lifelong Learning with A-GEM, January 2019. URL http://arxiv.org/abs/1812.00420. arXiv:1812.00420 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[10]

ccdv/pubmed: Pubmed abstracts and articles

Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli. ccdv/pubmed: Pubmed abstracts and articles. https: //huggingface.co/datasets/ccdv/pubmed-summarization, 2018. HuggingFace mirror of the PubMed long-document summarisation corpus used here as a biomedical adaptation target

2018
[11]

John Wiley & Sons, 1999

Thomas M Cover.Elements of information theory. John Wiley & Sons, 1999

1999
[12]

Mistral-splade: Llms for better learned sparse retrieval.arXiv preprint arXiv:2408.11119, 2024

Meet Doshi, Vishwajeet Kumar, Rudra Murthy, Jaydeep Sen, et al. Mistral-splade: Llms for better learned sparse retrieval.arXiv preprint arXiv:2408.11119, 2024

work page arXiv 2024
[13]

Otaduy, and Dan Casas

Arthur Douillard, Alexandre Rame, Guillaume Couairon, and Matthieu Cord. DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9275–9285, New Orleans, LA, USA, June 2022. IEEE. ISBN 978-1-6654-6946-3. doi: 10.1109/CVPR52688.2022.00907. URL https://ieeexp...

work page doi:10.1109/cvpr52688.2022.00907 2022
[14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

break-fix

Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, Atabak Ashfaq, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, et al. Phi-3 safety post-training: Aligning language models with a "break-fix" cycle.arXiv preprint arXiv:2407.13833, 2024

work page arXiv 2024
[16]

CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning, May 2025

Jiangpeng He, Zhihao Duan, and Fengqing Zhu. CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning, May 2025. URL http://arxiv.org/abs/2505.24816. arXiv:2505.24816 [cs]. 10

work page arXiv 2025
[17]

Distilling the Knowledge in a Neural Network, March

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network, March
[18]

Distilling the Knowledge in a Neural Network

URLhttp://arxiv.org/abs/1503.02531. arXiv:1503.02531 [stat]

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-Efficient Transfer Learning for NLP, June 2019. URLhttp://arxiv.org/abs/1902.00751. arXiv:1902.00751 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

STABLE: Gated continual learning for large language models.arXiv preprint arXiv:2510.16089, 2025

William Hoy and Nurcin Celik. STABLE: Gated continual learning for large language models.arXiv preprint arXiv:2510.16089, 2025

work page arXiv 2025
[21]

Safe LoRA: The silver lining of reducing safety risks when finetuning large language models

Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Safe LoRA: The silver lining of reducing safety risks when finetuning large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[22]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, October 2021. URL http://arxiv.org/abs/2106.09685. arXiv:2106.09685 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

LoraHub: Efficient cross-task generalization via dynamic lora composition.arXiv preprint arXiv:2307.13269, 2023

Chengsong Huang, Qian Liu, Min Lin, et al. LoraHub: Efficient cross-task generalization via dynamic lora composition.arXiv preprint arXiv:2307.13269, 2023

work page arXiv 2023
[24]

Knowledge Distillation from A Stronger Teacher, December 2022

Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. Knowledge Distillation from A Stronger Teacher, December 2022. URLhttp://arxiv.org/abs/2205.10536. arXiv:2205.10536 [cs]

work page arXiv 2022
[25]

OpenR1-Math-220k: Math reasoning dataset

HuggingFace Open-R1 Team. OpenR1-Math-220k: Math reasoning dataset. https://huggingface. co/datasets/open-r1/OpenR1-Math-220k , 2025. Released January 2025; post-Qwen2.5-cutoff math- reasoning corpus

2025
[26]

Efficient learning with sine-activated low-rank matrices, 2025

Yiping Ji, Hemanth Saratchandran, Cameron Gordon, Zeyu Zhang, and Simon Lucey. Efficient learning with sine-activated low-rank matrices, 2025. URLhttps://arxiv.org/abs/2403.19243

work page arXiv 2025
[27]

A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Clau- dia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural net- works.Proceedings of the National Academy of Sciences, 114(13):3...

work page doi:10.1073/pnas.1611835114 2017
[28]

Soroush Abbasi Koohpayegani, K. L. Navaneet, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. NOLA: Compressing LoRA using Linear Combination of Random Basis, April 2024. URL http: //arxiv.org/abs/2310.02556. arXiv:2310.02556 [cs]

work page arXiv 2024
[29]

Kopiczko, Tijmen Blankevoort, and Yuki M

Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based Random Matrix Adaptation, January 2024. URLhttp://arxiv.org/abs/2310.11454. arXiv:2310.11454 [cs]

work page arXiv 2024
[30]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter-Efficient Prompt Tuning, September 2021. URLhttp://arxiv.org/abs/2104.08691. arXiv:2104.08691 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

NTCE-KD: Non-target-class-enhanced knowledge distillation.Sensors, 24(11):3617, 2024

Chuan Li, Xiao Teng, Yan Ding, and Long Lan. NTCE-KD: Non-target-class-enhanced knowledge distillation.Sensors, 24(11):3617, 2024

2024
[32]

SaLoRA: Safety-alignment preserved low-rank adaptation

Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaLoRA: Safety-alignment preserved low-rank adaptation. InInternational Conference on Learning Representations (ICLR), 2025

2025
[33]

Prefix-Tuning: Optimizing Continuous Prompts for Generation, January

Xiang Lisa Li and Percy Liang. Prefix-Tuning: Optimizing Continuous Prompts for Generation, January
[34]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

URLhttp://arxiv.org/abs/2101.00190. arXiv:2101.00190 [cs]

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Learning without Forgetting

Zhizhong Li and Derek Hoiem. Learning without Forgetting, February 2017. URL http://arxiv.org/ abs/1606.09282. arXiv:1606.09282 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

InfLoRA: Interference-free low-rank adaptation for continual learning

Yan-Shuo Liang and Wu-Jun Li. InfLoRA: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[37]

DoRA: Weight-Decomposed Low-Rank Adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-Decomposed Low-Rank Adaptation, July 2024. URL http://arxiv.org/abs/2402.09353. arXiv:2402.09353 [cs]. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Controlled low-rank adaptation with subspace regularization for continued training on large language models

Yuheng Lu, Bingshuo Qian, Caixia Yuan, Huixing Jiang, and Xiaojie Wang. Controlled low-rank adaptation with subspace regularization for continued training on large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19165– 19181, 2025

2025
[39]

PEFT: State-of-the-art parameter-efficient fine-tuning

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. PEFT: State-of-the-art parameter-efficient fine-tuning. https://github.com/huggingface/peft, 2022

2022
[40]

Michael McCloskey and Neal J. Cohen. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989. ISBN 978-0-12-543324-2. doi: 10.1016/S0079-7421(08)60536-8. URL https:// linkinghub.elsevier.com/retrieve/pii/S0079742108605368

work page doi:10.1016/s0079-7421(08)60536-8 1989
[41]

Pissa: Principal singular values and singular vectors adaptation of large language models.arXiv preprint arXiv:2404.02948, 2024

Fanxu Meng, Zhaohui Wang, and Muhan Zhang. PiSSA: Principal Singular Values and Singular Vec- tors Adaptation of Large Language Models, April 2025. URL http://arxiv.org/abs/2404.02948. arXiv:2404.02948 [cs]

work page arXiv 2025
[42]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations (ICLR), 2017

2017
[43]

Learning from the undesirable: Robust adaptation of language models without forgetting

Yunhun Nam, Jaehyung Kim, and Jongheon Jeong. Learning from the undesirable: Robust adaptation of language models without forgetting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32537–32545, 2026

2026
[44]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016

2016
[45]

LFPT5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5

Chengwei Qin and Shafiq Joty. LFPT5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5. InInternational Conference on Learning Representations (ICLR), 2022

2022
[46]

Tahaei, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, and Mehdi Rezagholizadeh

Hossein Rajabzadeh, Mojtaba Valipour, Tianshu Zhu, Marzieh S. Tahaei, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, and Mehdi Rezagholizadeh. QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 712–718, Miami, Flo...

work page doi:10.18653/v1/2024.emnlp-industry.53 2024
[47]

The effectiveness of approximate regularized replay for efficient supervised fine-tuning of large language models.arXiv preprint arXiv:2512.22337, 2025

Matthew Riemer, Erik Miehling, Miao Liu, Djallel Bouneffouf, and Murray Campbell. The effectiveness of approximate regularized replay for efficient supervised fine-tuning of large language models.arXiv preprint arXiv:2512.22337, 2025

work page arXiv 2025
[48]

LoRA vs Full Fine- tuning: An Illusion of Equivalence, October 2025

Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs Full Fine- tuning: An Illusion of Equivalence, October 2025. URL http://arxiv.org/abs/2410.21228. arXiv:2410.21228 [cs]

work page arXiv 2025
[49]

Logit Standardization in Knowledge Distillation, March 2024

Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, and Xiaochun Cao. Logit Standardization in Knowledge Distillation, March 2024. URLhttp://arxiv.org/abs/2403.01427. arXiv:2403.01427 [cs]

work page arXiv 2024
[50]

Titsias, Jonathan Schwarz, Alexander G

Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, and Yee Whye Teh. Functional Regularisation for Continual Learning with Gaussian Processes, February 2020. URL http://arxiv.org/abs/1901.11356. arXiv:1901.11356 [stat]

work page arXiv 2020
[51]

DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation, April 2023

Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation, April 2023. URL http://arxiv.org/abs/2210.07558. arXiv:2210.07558 [cs]

work page arXiv 2023
[52]

CoT-VLA: Visual chain-of-thought reasoning for vision- language-action models,

Huiyi Wang, Haodong Lu, Lina Yao, and Dong Gong. Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10087–10098, June 2025. doi: 10.1109/CVPR52734.2025.00943. URL https://ieeexplore.ieee.org/document/11093725/. ISSN: 2575-7075

work page doi:10.1109/cvpr52734.2025.00943 2025
[53]

A Comprehensive Survey of Continual Learn- ing: Theory, Method and Application, February 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A Comprehensive Survey of Continual Learn- ing: Theory, Method and Application, February 2024. URL http://arxiv.org/abs/2302.00487. arXiv:2302.00487 [cs]. 12

work page arXiv 2024
[54]

Orthogonal Subspace Learning for Language Model Continual Learning, October 2023

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal Subspace Learning for Language Model Continual Learning, October 2023. URL http://arxiv.org/abs/2310.14152. arXiv:2310.14152 [cs]

work page arXiv 2023
[55]

TIES-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-merging: Resolving interference when merging models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[56]

Otaduy, and Dan Casas

Shipeng Yan, Jiangwei Xie, and Xuming He. DER: Dynamically Expandable Representation for Class Incremental Learning. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3013–3022, Nashville, TN, USA, June 2021. IEEE. ISBN 978-1-6654-4509-2. doi: 10.1109/CVPR46437.2021.00303. URLhttps://ieeexplore.ieee.org/document/9578633/

work page doi:10.1109/cvpr46437.2021.00303 2021
[57]

V*: Guided visual search as a core mechanism in multimodal llms

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23219–23230, Seattle, WA, USA, June 2024. IEEE. ISBN 979-8-3503-5300-6. doi: 10.1109/CVPR52733.2024.02191. ...

work page doi:10.1109/cvpr52733.2024.02191 2024
[58]

Language models are super mario: Absorbing abilities from homologous models as a free lunch.arXiv preprint arXiv:2311.03099, 2023

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch.arXiv preprint arXiv:2311.03099, 2023

work page arXiv 2023
[59]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning, December 2023. URLhttp://arxiv.org/abs/2303.10512. arXiv:2303.10512 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

C-LoRA: Continual low-rank adaptation for pre-trained models.arXiv preprint arXiv:2502.17920, 2025

Xin Zhang, Liang Bai, Xian Yang, and Jiye Liang. C-LoRA: Continual low-rank adaptation for pre-trained models.arXiv preprint arXiv:2502.17920, 2025

work page arXiv 2025
[61]

Decoupled Knowledge Distillation, July

Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled Knowledge Distillation, July
[62]

undesirable

URLhttp://arxiv.org/abs/2203.08679. arXiv:2203.08679 [cs]. 13 Supplementary Material A Extended Related Work The main-text related work (§2) condenses the literature into three paragraphs. This appendix expands the discussion for readers who want a more complete picture of the LoRA, replay-free continual learning, and output-space distillation literatures...

work page arXiv 2024
[63]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

2026

[1] [1]

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors

Imtiaz Ahmed, Sadman Islam, Partha Protim Datta, Imran Kabir, Md Naseef Ur Rahman Chowdhury, and Ahshanul Haque. Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors. 2025

2025

[2] [2]

Zhang, Hemanth Saratchandran, Anton van den Hengel, and Ehsan Abbasnejad

Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Anton van den Hengel, and Ehsan Abbasnejad. Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri–Rao Product, August 2025. URLhttp://arxiv.org/abs/2508.00230. arXiv:2508.00230 [cs]

work page arXiv 2025

[3] [3]

Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad

Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. RandLoRA: Full-rank parameter-efficient fine-tuning of large models, March 2025. URLhttp://arxiv.org/abs/2502.00987. arXiv:2502.00987 [cs]

work page arXiv 2025

[4] [4]

PLD: A Choice-Theoretic List-Wise Knowledge Distillation,

Ejafa Bassam, Dawei Zhu, and Kaigui Bian. PLD: A Choice-Theoretic List-Wise Knowledge Distillation,

[5] [5]

PLD: A Choice-Theoretic List-Wise Knowledge Distillation

URLhttps://arxiv.org/abs/2506.12542. Version Number: 3

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models.arXiv preprint arXiv:2106.10199, 2021

Elad Ben-Zaken, Shauli Ravfogel, and Yoav Goldberg. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models, September 2022. URL http://arxiv.org/abs/2106. 10199. arXiv:2106.10199 [cs]

work page arXiv 2022

[7] [7]

Cunningham

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John P. Cunningham. LoRA Learns Less and Forgets Less, September 2024. URL http://arxiv.org/abs/ 2405.09673. arXiv:2405.09673 [cs] version: 2

work page arXiv 2024

[8] [8]

Dark Experience for General Continual Learning: a Strong, Simple Baseline, October 2020

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark Experience for General Continual Learning: a Strong, Simple Baseline, October 2020. URL http://arxiv.org/ abs/2004.07211. arXiv:2004.07211 [stat]

work page arXiv 2020

[9] [9]

Efficient Lifelong Learning with A-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient Lifelong Learning with A-GEM, January 2019. URL http://arxiv.org/abs/1812.00420. arXiv:1812.00420 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[10] [10]

ccdv/pubmed: Pubmed abstracts and articles

Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli. ccdv/pubmed: Pubmed abstracts and articles. https: //huggingface.co/datasets/ccdv/pubmed-summarization, 2018. HuggingFace mirror of the PubMed long-document summarisation corpus used here as a biomedical adaptation target

2018

[11] [11]

John Wiley & Sons, 1999

Thomas M Cover.Elements of information theory. John Wiley & Sons, 1999

1999

[12] [12]

Mistral-splade: Llms for better learned sparse retrieval.arXiv preprint arXiv:2408.11119, 2024

Meet Doshi, Vishwajeet Kumar, Rudra Murthy, Jaydeep Sen, et al. Mistral-splade: Llms for better learned sparse retrieval.arXiv preprint arXiv:2408.11119, 2024

work page arXiv 2024

[13] [13]

Otaduy, and Dan Casas

Arthur Douillard, Alexandre Rame, Guillaume Couairon, and Matthieu Cord. DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9275–9285, New Orleans, LA, USA, June 2022. IEEE. ISBN 978-1-6654-6946-3. doi: 10.1109/CVPR52688.2022.00907. URL https://ieeexp...

work page doi:10.1109/cvpr52688.2022.00907 2022

[14] [14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

break-fix

Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, Atabak Ashfaq, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, et al. Phi-3 safety post-training: Aligning language models with a "break-fix" cycle.arXiv preprint arXiv:2407.13833, 2024

work page arXiv 2024

[16] [16]

CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning, May 2025

Jiangpeng He, Zhihao Duan, and Fengqing Zhu. CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning, May 2025. URL http://arxiv.org/abs/2505.24816. arXiv:2505.24816 [cs]. 10

work page arXiv 2025

[17] [17]

Distilling the Knowledge in a Neural Network, March

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network, March

[18] [18]

Distilling the Knowledge in a Neural Network

URLhttp://arxiv.org/abs/1503.02531. arXiv:1503.02531 [stat]

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-Efficient Transfer Learning for NLP, June 2019. URLhttp://arxiv.org/abs/1902.00751. arXiv:1902.00751 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [20]

STABLE: Gated continual learning for large language models.arXiv preprint arXiv:2510.16089, 2025

William Hoy and Nurcin Celik. STABLE: Gated continual learning for large language models.arXiv preprint arXiv:2510.16089, 2025

work page arXiv 2025

[21] [21]

Safe LoRA: The silver lining of reducing safety risks when finetuning large language models

Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Safe LoRA: The silver lining of reducing safety risks when finetuning large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[22] [22]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, October 2021. URL http://arxiv.org/abs/2106.09685. arXiv:2106.09685 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

LoraHub: Efficient cross-task generalization via dynamic lora composition.arXiv preprint arXiv:2307.13269, 2023

Chengsong Huang, Qian Liu, Min Lin, et al. LoraHub: Efficient cross-task generalization via dynamic lora composition.arXiv preprint arXiv:2307.13269, 2023

work page arXiv 2023

[24] [24]

Knowledge Distillation from A Stronger Teacher, December 2022

Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. Knowledge Distillation from A Stronger Teacher, December 2022. URLhttp://arxiv.org/abs/2205.10536. arXiv:2205.10536 [cs]

work page arXiv 2022

[25] [25]

OpenR1-Math-220k: Math reasoning dataset

HuggingFace Open-R1 Team. OpenR1-Math-220k: Math reasoning dataset. https://huggingface. co/datasets/open-r1/OpenR1-Math-220k , 2025. Released January 2025; post-Qwen2.5-cutoff math- reasoning corpus

2025

[26] [26]

Efficient learning with sine-activated low-rank matrices, 2025

Yiping Ji, Hemanth Saratchandran, Cameron Gordon, Zeyu Zhang, and Simon Lucey. Efficient learning with sine-activated low-rank matrices, 2025. URLhttps://arxiv.org/abs/2403.19243

work page arXiv 2025

[27] [27]

A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Clau- dia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural net- works.Proceedings of the National Academy of Sciences, 114(13):3...

work page doi:10.1073/pnas.1611835114 2017

[28] [28]

Soroush Abbasi Koohpayegani, K. L. Navaneet, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. NOLA: Compressing LoRA using Linear Combination of Random Basis, April 2024. URL http: //arxiv.org/abs/2310.02556. arXiv:2310.02556 [cs]

work page arXiv 2024

[29] [29]

Kopiczko, Tijmen Blankevoort, and Yuki M

Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based Random Matrix Adaptation, January 2024. URLhttp://arxiv.org/abs/2310.11454. arXiv:2310.11454 [cs]

work page arXiv 2024

[30] [30]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter-Efficient Prompt Tuning, September 2021. URLhttp://arxiv.org/abs/2104.08691. arXiv:2104.08691 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[31] [31]

NTCE-KD: Non-target-class-enhanced knowledge distillation.Sensors, 24(11):3617, 2024

Chuan Li, Xiao Teng, Yan Ding, and Long Lan. NTCE-KD: Non-target-class-enhanced knowledge distillation.Sensors, 24(11):3617, 2024

2024

[32] [32]

SaLoRA: Safety-alignment preserved low-rank adaptation

Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaLoRA: Safety-alignment preserved low-rank adaptation. InInternational Conference on Learning Representations (ICLR), 2025

2025

[33] [33]

Prefix-Tuning: Optimizing Continuous Prompts for Generation, January

Xiang Lisa Li and Percy Liang. Prefix-Tuning: Optimizing Continuous Prompts for Generation, January

[34] [34]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

URLhttp://arxiv.org/abs/2101.00190. arXiv:2101.00190 [cs]

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Learning without Forgetting

Zhizhong Li and Derek Hoiem. Learning without Forgetting, February 2017. URL http://arxiv.org/ abs/1606.09282. arXiv:1606.09282 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

InfLoRA: Interference-free low-rank adaptation for continual learning

Yan-Shuo Liang and Wu-Jun Li. InfLoRA: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[37] [37]

DoRA: Weight-Decomposed Low-Rank Adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-Decomposed Low-Rank Adaptation, July 2024. URL http://arxiv.org/abs/2402.09353. arXiv:2402.09353 [cs]. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Controlled low-rank adaptation with subspace regularization for continued training on large language models

Yuheng Lu, Bingshuo Qian, Caixia Yuan, Huixing Jiang, and Xiaojie Wang. Controlled low-rank adaptation with subspace regularization for continued training on large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19165– 19181, 2025

2025

[39] [39]

PEFT: State-of-the-art parameter-efficient fine-tuning

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. PEFT: State-of-the-art parameter-efficient fine-tuning. https://github.com/huggingface/peft, 2022

2022

[40] [40]

Michael McCloskey and Neal J. Cohen. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989. ISBN 978-0-12-543324-2. doi: 10.1016/S0079-7421(08)60536-8. URL https:// linkinghub.elsevier.com/retrieve/pii/S0079742108605368

work page doi:10.1016/s0079-7421(08)60536-8 1989

[41] [41]

Pissa: Principal singular values and singular vectors adaptation of large language models.arXiv preprint arXiv:2404.02948, 2024

Fanxu Meng, Zhaohui Wang, and Muhan Zhang. PiSSA: Principal Singular Values and Singular Vec- tors Adaptation of Large Language Models, April 2025. URL http://arxiv.org/abs/2404.02948. arXiv:2404.02948 [cs]

work page arXiv 2025

[42] [42]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations (ICLR), 2017

2017

[43] [43]

Learning from the undesirable: Robust adaptation of language models without forgetting

Yunhun Nam, Jaehyung Kim, and Jongheon Jeong. Learning from the undesirable: Robust adaptation of language models without forgetting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32537–32545, 2026

2026

[44] [44]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016

2016

[45] [45]

LFPT5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5

Chengwei Qin and Shafiq Joty. LFPT5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5. InInternational Conference on Learning Representations (ICLR), 2022

2022

[46] [46]

Tahaei, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, and Mehdi Rezagholizadeh

Hossein Rajabzadeh, Mojtaba Valipour, Tianshu Zhu, Marzieh S. Tahaei, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, and Mehdi Rezagholizadeh. QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 712–718, Miami, Flo...

work page doi:10.18653/v1/2024.emnlp-industry.53 2024

[47] [47]

The effectiveness of approximate regularized replay for efficient supervised fine-tuning of large language models.arXiv preprint arXiv:2512.22337, 2025

Matthew Riemer, Erik Miehling, Miao Liu, Djallel Bouneffouf, and Murray Campbell. The effectiveness of approximate regularized replay for efficient supervised fine-tuning of large language models.arXiv preprint arXiv:2512.22337, 2025

work page arXiv 2025

[48] [48]

LoRA vs Full Fine- tuning: An Illusion of Equivalence, October 2025

Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs Full Fine- tuning: An Illusion of Equivalence, October 2025. URL http://arxiv.org/abs/2410.21228. arXiv:2410.21228 [cs]

work page arXiv 2025

[49] [49]

Logit Standardization in Knowledge Distillation, March 2024

Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, and Xiaochun Cao. Logit Standardization in Knowledge Distillation, March 2024. URLhttp://arxiv.org/abs/2403.01427. arXiv:2403.01427 [cs]

work page arXiv 2024

[50] [50]

Titsias, Jonathan Schwarz, Alexander G

Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, and Yee Whye Teh. Functional Regularisation for Continual Learning with Gaussian Processes, February 2020. URL http://arxiv.org/abs/1901.11356. arXiv:1901.11356 [stat]

work page arXiv 2020

[51] [51]

DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation, April 2023

Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation, April 2023. URL http://arxiv.org/abs/2210.07558. arXiv:2210.07558 [cs]

work page arXiv 2023

[52] [52]

CoT-VLA: Visual chain-of-thought reasoning for vision- language-action models,

Huiyi Wang, Haodong Lu, Lina Yao, and Dong Gong. Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10087–10098, June 2025. doi: 10.1109/CVPR52734.2025.00943. URL https://ieeexplore.ieee.org/document/11093725/. ISSN: 2575-7075

work page doi:10.1109/cvpr52734.2025.00943 2025

[53] [53]

A Comprehensive Survey of Continual Learn- ing: Theory, Method and Application, February 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A Comprehensive Survey of Continual Learn- ing: Theory, Method and Application, February 2024. URL http://arxiv.org/abs/2302.00487. arXiv:2302.00487 [cs]. 12

work page arXiv 2024

[54] [54]

Orthogonal Subspace Learning for Language Model Continual Learning, October 2023

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal Subspace Learning for Language Model Continual Learning, October 2023. URL http://arxiv.org/abs/2310.14152. arXiv:2310.14152 [cs]

work page arXiv 2023

[55] [55]

TIES-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. TIES-merging: Resolving interference when merging models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[56] [56]

Otaduy, and Dan Casas

Shipeng Yan, Jiangwei Xie, and Xuming He. DER: Dynamically Expandable Representation for Class Incremental Learning. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3013–3022, Nashville, TN, USA, June 2021. IEEE. ISBN 978-1-6654-4509-2. doi: 10.1109/CVPR46437.2021.00303. URLhttps://ieeexplore.ieee.org/document/9578633/

work page doi:10.1109/cvpr46437.2021.00303 2021

[57] [57]

V*: Guided visual search as a core mechanism in multimodal llms

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23219–23230, Seattle, WA, USA, June 2024. IEEE. ISBN 979-8-3503-5300-6. doi: 10.1109/CVPR52733.2024.02191. ...

work page doi:10.1109/cvpr52733.2024.02191 2024

[58] [58]

Language models are super mario: Absorbing abilities from homologous models as a free lunch.arXiv preprint arXiv:2311.03099, 2023

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch.arXiv preprint arXiv:2311.03099, 2023

work page arXiv 2023

[59] [59]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning, December 2023. URLhttp://arxiv.org/abs/2303.10512. arXiv:2303.10512 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

C-LoRA: Continual low-rank adaptation for pre-trained models.arXiv preprint arXiv:2502.17920, 2025

Xin Zhang, Liang Bai, Xian Yang, and Jiye Liang. C-LoRA: Continual low-rank adaptation for pre-trained models.arXiv preprint arXiv:2502.17920, 2025

work page arXiv 2025

[61] [61]

Decoupled Knowledge Distillation, July

Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled Knowledge Distillation, July

[62] [62]

undesirable

URLhttp://arxiv.org/abs/2203.08679. arXiv:2203.08679 [cs]. 13 Supplementary Material A Extended Related Work The main-text related work (§2) condenses the literature into three paragraphs. This appendix expands the discussion for readers who want a more complete picture of the LoRA, replay-free continual learning, and output-space distillation literatures...

work page arXiv 2024

[63] [63]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

2026