pith. sign in

arxiv: 2605.19028 · v1 · pith:JHTV7PO4new · submitted 2026-05-18 · 💻 cs.LG

Learning When to Adapt

Pith reviewed 2026-05-20 12:00 UTC · model grok-4.3

classification 💻 cs.LG
keywords parameter-efficient fine-tuningLoRAdynamic adaptationcatastrophic forgettinglarge language modelsinput-dependent gating
0
0 comments X

The pith

DISeL uses input-dependent gates on LoRA components to activate adaptations only when they improve task performance, thereby reducing catastrophic forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Low-rank adaptation methods like LoRA apply the same update to all inputs, which forces a trade-off between adapting to new data and keeping the original model behavior on other inputs. This static approach often leads to forgetting of pre-trained capabilities. DISeL adds lightweight gates that control each rank-one update based on the current input, learning to turn them on only when they help the fine-tuning objective while defaulting to the pre-trained state otherwise. Tests across language understanding benchmarks and reasoning tasks with various models show reduced forgetting compared to LoRA variants at similar accuracy levels. The gates also offer a way to see which layers adapt most for a given task.

Core claim

The central discovery is that making the low-rank updates input-sensitive through per-component gating allows the fine-tuning process to preserve pre-trained behavior by default. The gates are designed to activate selected rank-one components selectively, adding only a small number of parameters while maintaining the efficiency of the low-rank structure.

What carries the argument

Lightweight input-dependent gates over individual rank-one components of LoRA modules, which learn to activate only when they reduce the fine-tuning loss.

If this is right

  • Models fine-tuned with DISeL show less forgetting on inputs outside the target distribution compared to static LoRA.
  • The approach maintains competitive accuracy on fine-tuning tasks such as GLUE, mathematical reasoning, and code generation.
  • Gate activation patterns reveal which layers and rank components concentrate the task-specific changes.
  • Only a small number of additional parameters are required, preserving the parameter efficiency of LoRA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This selective activation mechanism might extend to other parameter-efficient methods to handle continual learning scenarios better.
  • Analyzing the learned gates could inform which parts of large models are most plastic for specific task types.
  • Deployed models could potentially use these gates for input-aware behavior without full retraining.

Load-bearing premise

The input-dependent gates can be optimized to selectively activate only those rank-one components that improve the fine-tuning objective without introducing instability or high computational overhead.

What would settle it

If experiments show that DISeL does not reduce forgetting metrics compared to LoRA on the tested models and tasks, or if the gates fail to remain mostly inactive on out-of-distribution inputs.

Figures

Figures reproduced from arXiv: 2605.19028 by Ali Zindari, Rotem Mulayoff, Sebastian U. Stich, Xiaowen Jiang.

Figure 1
Figure 1. Figure 1: Selective adaptation toy example. We minimize the loss in (3) on inputs drawn from a symmetric mixture of two Gaussian populations (see Appendix I for details). Panel (a) shows the test MSE on fine-tuning and pre-training inputs. Here, LORA and full FT reach the fixed-correction tradeoff, whereas DISeL achieves much lower error on both domains, approaching the Bayes-optimal error floor (dashed line). Panel… view at source ↗
Figure 2
Figure 2. Figure 2: ROBERTa on GLUE. We fine-tune ROBERTa-base on five GLUE tasks: MNLI, SST-2, QNLI, COLA, and MRPC, using various methods and ranks. The vertical axis measures the average test accuracy for the corresponding tasks, while the horizontal axis indicates the average masked-LM perplexity, evaluated on BOOKCORPUSOPEN, CC-NEWS, and WIKITEXT-2. Each point represents a method-rank pair averaged across three seeds, wi… view at source ↗
Figure 3
Figure 3. Figure 3: LLAMA and MISTRAL instruction fine-tuning. We fine-tune LLAMA and MISTRAL on math and code instructions, respectively, with various methods using different ranks. Panel (a) shows the results for LLAMA, whereas Panel (b) depicts them for MISTRAL. Here, the vertical axis shows test accuracy for the corresponding fine-tuning task, and the horizontal axis shows the average accuracy across several pre-training … view at source ↗
Figure 4
Figure 4. Figure 4: Gate activation histograms across domains. We analyze gate activations of a rank-32 LLAMA math adapter on held-out samples from math, general text, and code. Each panel corresponds to a layer-depth band: early, middle, or late layers. Here, the gates are substantially more open for math inputs, the fine-tuning domain, than for general text from the pre-training distribution. The largest differences appear … view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise gate usage within key modules. We processed held-out math samples with LLAMA using a rank-32 DISeL adapter and measured the output distributions of each module’s gates. This figure presents these distributions for the attention value projection (vproj), the MLP up projection (upproj), and the MLP down projection (downproj). Each panel shows normalized histograms for early, middle, and late layer… view at source ↗
Figure 6
Figure 6. Figure 6: Retention vs. training step. We evaluate the retention performance during fine-tuning from the experiment in Sec. 4.2. Here we present the results for LLAMA 2- 7B model fine-tuned on METAMATHQA using LORA and DISeL with various ranks. Panel (a) shows the result for DISeL, while Panel (b) plots them for LORA. We see that DISeL maintains near-zero retention drop at every rank, including the largest, whereas … view at source ↗
Figure 7
Figure 7. Figure 7: Fine-tuning accuracy over training checkpoints for LORA and DISeL on LLAMA 2 7B / METAMATHQA. Each curve reports GSM8K accuracy at every saved checkpoint, for adapter ranks r ∈ {16, 32, 64, 128, 256}. 500 750 1000 1250 1500 1750 2000 Checkpoint step 62 64 66 68 70 R e t e n t i o n a c c u r a c y ( % ) r=16 r=32 r=64 r=128 r=256 Base model (a) DISeL: retention stays close to the base model across ranks an… view at source ↗
Figure 8
Figure 8. Figure 8: Retention over training checkpoints for LORA and DISeL on MISTRAL 7B / MAGI￾CODER. Each curve tracks the unweighted mean accuracy across the 14-benchmark retention suite at every saved checkpoint, for adapter ranks r ∈ {16, 32, 64, 128, 256}. IV Checkpoint Dynamics: Retention and Fine-Tuning Accuracy IV.1 LLAMA 2 7B / METAMATHQA Section 6 reports the checkpoint-level retention curves for the LLAMA 2 7B / M… view at source ↗
Figure 9
Figure 9. Figure 9: Fine-tuning accuracy over training checkpoints for LORA and DISeL on MISTRAL 7B / MAGICODER. Each curve reports HUMANEVAL pass@1 at every saved checkpoint, for adapter ranks r ∈ {16, 32, 64, 128, 256}. V Interpretability: Additional Results This appendix complements the gate-activation analysis of Section 5, which focused on LLAMA 2 7B at rank r=32. We report (i) additional LLAMA 2 7B results across ranks … view at source ↗
Figure 10
Figure 10. Figure 10: Gate distributions by module family and depth on LLAMA math inputs. We aggregate gate values from the rank-32 LLAMA math adapter over held-out GSM8K examples. Attention de￾notes q_proj, k_proj, v_proj, and o_proj; MLP denotes gate_proj, up_proj, and down_proj. Gates open most strongly in the middle and late layers, especially in the MLP stack. 0.00 0.25 0.50 0.75 1.00 Gate value 0 5 10 15 Density qproj Ea… view at source ↗
Figure 11
Figure 11. Figure 11: Gate distributions by projection on LLAMA math inputs. We show gate-value histograms for all adapted projections of the rank-32 LLAMA math adapter, grouped by layer-depth band. The MLP gate_proj and up_proj are the most active, while v_proj and early down_proj remain close to closed. that the code adapter treats math and code as closely related structured-symbolic inputs, while still opening its highest-c… view at source ↗
Figure 12
Figure 12. Figure 12: Gate activation histograms across domains for the MISTRAL code adapter. We analyze the rank-32 MISTRAL adapter fine-tuned on MAGICODER. Unlike the LLAMA math adapter, where math is clearly separated from code, the MISTRAL code adapter opens gates to a similar degree on held-out code and math inputs, while general text remains lower. Code nevertheless has the largest concentration of fully open gates, even… view at source ↗
Figure 13
Figure 13. Figure 13: Gate distributions by module family and depth on MISTRAL code inputs. Attention de￾notes q_proj, k_proj, v_proj, and o_proj; MLP denotes gate_proj, up_proj, and down_proj. As in the LLAMA analysis, gates open most strongly in the MLP stack and in middle-to-late layers. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Gate distributions by projection on MISTRAL code inputs. We show gate-value histograms for all adapted projections of the rank-32 MISTRAL code adapter, grouped by layer-depth band. The largest activations occur in the MLP gate_proj and up_proj, while several attention projections remain mostly closed until late layers. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
read the original abstract

Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method, yet its learned correction is static: the same low-rank update is applied to every input. This input-agnostic approach creates an inevitable compromise between adapting to the fine-tuning distribution and preserving pre-trained behavior on inputs outside that distribution, contributing to catastrophic forgetting. We introduce DISeL (Dynamic Input-Sensitive LoRA), which augments LoRA modules with lightweight input-dependent gates over individual rank-one components. The gating mechanism is designed to preserve the pre-trained model's behavior by default, while training learns to activate selected components that reduce the fine-tuning loss. DISeL adds only a small number of parameters and preserves the low-rank structure. Across RoBERTa on GLUE, and Llama and Mistral models fine-tuned for mathematical reasoning and code generation, DISeL reduces forgetting relative to LoRA and related variants while maintaining competitive fine-tuning accuracy. In addition, the learned gate activations provide an interpretable diagnostic view of which layers and rank components are most activated during fine-tuning, giving insight into where task-specific adaptation is concentrated. Code available at https://github.com/alizindari/DISeL .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes DISeL, an augmentation to LoRA that introduces lightweight input-dependent gates over individual rank-one components. The gates are designed to default to preserving pre-trained behavior while learning to activate selected components only when they reduce fine-tuning loss. Empirical results are reported on RoBERTa fine-tuned on GLUE and on Llama/Mistral models for mathematical reasoning and code generation, claiming reduced forgetting relative to LoRA and related variants with competitive accuracy and added interpretability from gate activations.

Significance. If the central empirical claim holds after proper controls, the work would offer a practical route to input-sensitive parameter-efficient adaptation that mitigates catastrophic forgetting without substantial overhead, while the gate-activation diagnostics could provide useful insight into where task-specific updates concentrate in large models.

major comments (3)
  1. [§4] §4 (Experimental results): the comparisons to LoRA variants do not include an ablation with static (non-input-dependent) extra parameters or a fixed gating structure; without this control it remains unclear whether the reported forgetting reduction is driven by the input-sensitive mechanism or by the mere presence of additional trainable parameters.
  2. [§4] §4 and associated tables: no standard deviations, multiple random seeds, or statistical significance tests are reported for the forgetting and accuracy metrics across model families; this leaves the consistency of the gains difficult to assess.
  3. [§3] §3 (Method): the description of gate initialization and the training objective that enforces default preservation of pre-trained behavior lacks sufficient detail on hyper-parameters and regularization to evaluate whether the claimed stability is achieved by construction or by tuning.
minor comments (2)
  1. [Abstract] The abstract refers to 'related variants' without naming them; the experimental section should explicitly list the baselines (e.g., DoRA, VeRA) used for comparison.
  2. [Figures] Figure captions and axis labels for gate-activation heatmaps should include the exact layer indices and rank indices shown to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense of our work while agreeing to revisions that strengthen the empirical claims and clarity.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental results): the comparisons to LoRA variants do not include an ablation with static (non-input-dependent) extra parameters or a fixed gating structure; without this control it remains unclear whether the reported forgetting reduction is driven by the input-sensitive mechanism or by the mere presence of additional trainable parameters.

    Authors: We agree that this ablation would more cleanly isolate the contribution of input dependence. Our existing comparisons are to LoRA and related methods such as DoRA, but these do not hold the number of additional trainable parameters exactly fixed while removing input sensitivity. We will add the requested control (static extra rank-one updates without gates, or a fixed non-input-dependent gating structure) in the revised experiments to demonstrate that the forgetting reduction arises from the dynamic mechanism rather than parameter count alone. revision: yes

  2. Referee: [§4] §4 and associated tables: no standard deviations, multiple random seeds, or statistical significance tests are reported for the forgetting and accuracy metrics across model families; this leaves the consistency of the gains difficult to assess.

    Authors: We concur that reporting variability is important for assessing reliability. In the revised manuscript we will rerun the key experiments across at least three random seeds, report means and standard deviations for both accuracy and forgetting metrics on RoBERTa/GLUE and the Llama/Mistral tasks, and include statistical significance tests (e.g., paired t-tests) to quantify the consistency of improvements across model families. revision: yes

  3. Referee: [§3] §3 (Method): the description of gate initialization and the training objective that enforces default preservation of pre-trained behavior lacks sufficient detail on hyper-parameters and regularization to evaluate whether the claimed stability is achieved by construction or by tuning.

    Authors: We will expand §3 with the missing details. Specifically, we will describe the gate initialization (gates are initialized to strongly favor the identity/pre-trained state), the precise form of the auxiliary loss or regularization term that penalizes unnecessary activation on out-of-distribution inputs, and the full set of hyper-parameters (learning rates, regularization coefficients, temperature, etc.) used in all reported runs. This will make explicit how default preservation is encouraged both by architecture and by the objective. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical augmentation with independent experimental validation

full rationale

The paper proposes DISeL as a practical extension of LoRA by adding lightweight input-dependent gates over rank-one components. No derivation chain exists that reduces a claimed result to its own inputs by construction. The core claims rest on reported fine-tuning accuracy and forgetting metrics across RoBERTa/GLUE, Llama, and Mistral experiments, which are externally falsifiable against baselines. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The method is self-contained as an algorithmic design whose value is assessed by standard empirical comparison rather than internal redefinition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a small gating network can be optimized jointly with the low-rank updates without introducing new failure modes. No new physical entities or mathematical axioms beyond standard neural-network training assumptions are introduced.

free parameters (1)
  • gate parameters
    Lightweight parameters for the input-dependent gates added to each rank-one component; their count is described as small but not quantified in the abstract.
axioms (1)
  • domain assumption Standard gradient-based optimization can jointly train the base LoRA weights and the new gates without instability.
    Implicit in the statement that training learns to activate selected components.
invented entities (1)
  • input-dependent gate over rank-one components no independent evidence
    purpose: To decide per-input whether to apply each low-rank update or leave the pre-trained weights unchanged.
    New mechanism introduced to make adaptation input-sensitive while preserving low-rank structure.

pith-pipeline@v0.9.0 · 5743 in / 1327 out tokens · 43569 ms · 2026-05-20T12:00:58.365759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 5 internal anchors

  1. [1]

    documentation debt

    Jack Bandy and Nicholas Vincent. Addressing "documentation debt" in machine learning research: A retrospective datasheet for bookcorpus.arXiv preprint arXiv:2105.05241, 2021

  2. [2]

    LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024

  3. [3]

    PIQA: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020

  4. [4]

    Eric L Buehler and Markus J Buehler. X-LoRA: Mixture of low-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and molecular design.APL Machine Learning, 2(2), 2024

  5. [5]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  6. [6]

    Learning rate scaling across LoRA ranks and transfer to full finetuning.arXiv preprint arXiv:2602.06204, 2026

    Nan Chen, Soledad Villar, and Soufiane Hayou. Learning rate scaling across LoRA ranks and transfer to full finetuning.arXiv preprint arXiv:2602.06204, 2026

  7. [7]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers)....

  8. [8]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv:1803.05457, 2018

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10

  10. [10]

    QLoRA: Efficient finetuning of quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2023

  11. [11]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers).Association for Computationa...

  12. [12]

    Automatically constructing a corpus of sentential para- phrases

    William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential para- phrases. InProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005

  13. [13]

    How abilities in large language models are affected by supervised fine-tuning data composition

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 177–198, 2024

  14. [14]

    AuroRA: Breaking low-rank bottleneck of LoRA with nonlinear mapping

    Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. AuroRA: Breaking low-rank bottleneck of LoRA with nonlinear mapping. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2025

  15. [15]

    Gated LoRA: Dual-purpose projections for parameter-efficient mini-expert fine-tuning

    SooHwan Eom, Hee Suk Yoon, Eunseop Yoon, Mark A Hasegawa-Johnson, and Chang D Yoon. Gated LoRA: Dual-purpose projections for parameter-efficient mini-expert fine-tuning. InSubmitted to International Conference on Learning Representations, 2025

  16. [16]

    LoRA+: Efficient low rank adaptation of large models

    Soufiane Hayou, Nikhil Ghosh, and Bin Yu. LoRA+: Efficient low rank adaptation of large models. InProceedings of the 41st International Conference on Machine Learning. PMLR, 2024

  17. [17]

    Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015

  18. [18]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InThe Ninth Interna- tional Conference on Learning Representations, 2021

  19. [19]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, 2022

  20. [20]

    LoraHub: Efficient cross-task generalization via dynamic LoRA composition

    Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. LoraHub: Efficient cross-task generalization via dynamic LoRA composition. InFirst Conference on Language Modeling, 2024

  21. [21]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B, 2023

  22. [22]

    Xiaowen Jiang, Xun Wang, and Sebastian U. Stich. LoRAM: Low-rank adaptation of large language models on manifold. InSparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference, 2025

  23. [23]

    What disease does this patient have? A large-scale open domain question answering dataset from medical exams

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.arXiv preprint arXiv:2009.13081, 2020

  24. [24]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611, 2017. 11

  25. [25]

    VeRA: Vector-based random matrix adaptation

    Dawid Kopiczko, Tijmen Blankevoort, and Yuki Asano. VeRA: Vector-based random matrix adaptation. InThe Twelfth International Conference on Learning Representations, 2024

  26. [26]

    Mixture of experts meets prompt-based continual learning

    Minh Le, An Nguyen, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Van Ngo, and Nhat Ho. Mixture of experts meets prompt-based continual learning. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2024

  27. [27]

    Gated integration of low-rank adaptation for continual learning of large language models

    Yan-Shuo Liang, Jia-Rui Chen, and Wu-Jun Li. Gated integration of low-rank adaptation for continual learning of large language models. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2025

  28. [28]

    DoRA: Weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. In Proceedings of the 41st International Conference on Machine Learning. PMLR, 2024

  29. [29]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019

  30. [30]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

  31. [31]

    WizardCoder: Empowering code large language models with Evol-Instruct

    Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. WizardCoder: Empowering code large language models with Evol-Instruct. InThe Twelfth International Conference on Learning Representations, 2024

  32. [32]

    CC-News-En: A large English news corpus

    Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R Trippas, J Shane Culpepper, and Alistair Moffat. CC-News-En: A large English news corpus. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020

  33. [33]

    RanPAC: Random projections and pre-trained models for continual learning

    Mark D McDonnell, Dong Gong, Amin Parvaneh, Ehsan Abbasnejad, and Anton Van den Hengel. RanPAC: Random projections and pre-trained models for continual learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2023

  34. [34]

    PiSSA: Principal singular values and singular vectors adaptation of large language models

    Fanxu Meng, Zhaohui Wang, and Muhan Zhang. PiSSA: Principal singular values and singular vectors adaptation of large language models. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2024

  35. [35]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InThe Fifth International Conference on Learning Representations, 2017

  36. [36]

    Can a suit of armor conduct electricity? A new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

  37. [37]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2022

  38. [38]

    Language models are unsupervised multitask learners, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019

  39. [39]

    SQuAD: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016

  40. [40]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  41. [41]

    WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 2021. 12

  42. [42]

    LoRA vs full fine-tuning: An illusion of equivalence

    Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2025

  43. [43]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013

  44. [44]

    CommonsenseQA: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long and Short Papers). Association for Computationa...

  45. [45]

    HydraLoRA: An asymmetric LoRA architecture for efficient fine-tuning

    Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Chengzhong Xu. HydraLoRA: An asymmetric LoRA architecture for efficient fine-tuning. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2024

  46. [46]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  47. [47]

    GLUE: A multi-task benchmark and analysis platform for natural language understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–355, 2018

  48. [48]

    Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality

    Liyuan Wang, Jingyi Xie, Xingxing Zhang, Mingyi Huang, Hang Su, and Jun Zhu. Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2023

  49. [49]

    A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

  50. [50]

    LoRA-GA: Low-rank adaptation with gradient ap- proximation

    Shaowen Wang, Linxi Yu, and Jian Li. LoRA-GA: Low-rank adaptation with gradient ap- proximation. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2024

  51. [51]

    S-prompts learning with pre-trained trans- formers: An Occam’s razor for domain incremental learning

    Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained trans- formers: An Occam’s razor for domain incremental learning. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2022

  52. [52]

    DualPrompt: Complementary prompting for rehearsal-free continual learning

    Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. DualPrompt: Complementary prompting for rehearsal-free continual learning. InEuropean Conference on Computer Vision, pages 631–648. Springer, 2022

  53. [53]

    Neural network acceptability judgments.Transactions of the Association for Computational Linguistics, 7:625–641, 2019

    Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments.Transactions of the Association for Computational Linguistics, 7:625–641, 2019. 13

  54. [54]

    Dai, and Quoc V Le

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InThe Tenth International Conference on Learning Representations, 2022

  55. [55]

    Magicoder: Empow- ering code generation with OSS-Instruct

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empow- ering code generation with OSS-Instruct. InProceedings of the 41st International Conference on Machine Learning. PMLR, 2024

  56. [56]

    Liu, and Matt Gardner

    Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InProceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, 2017

  57. [57]

    A broad-coverage challenge corpus for sentence understanding through inference

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2018

  58. [58]

    Mixture of LoRA experts

    Xun Wu, Shaohan Huang, and Furu Wei. Mixture of LoRA experts. InThe Twelfth International Conference on Learning Representations, 2024

  59. [59]

    Metamath: Bootstrap your own mathematical questions for large language models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. InThe Twelfth International Conference on Learning Representations, 2024

  60. [60]

    Mammoth: Building math generalist models through hybrid instruction tuning

    Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. InThe Twelfth International Conference on Learning Representations, 2024

  61. [61]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  62. [62]

    AdaLoRA: Adaptive budget allocation for parameter- efficient fine-tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. AdaLoRA: Adaptive budget allocation for parameter- efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023. 14 Appendix I Complete details of the motivating example This appendix gives the full d...