Learning When to Adapt
Pith reviewed 2026-05-20 12:00 UTC · model grok-4.3
The pith
DISeL uses input-dependent gates on LoRA components to activate adaptations only when they improve task performance, thereby reducing catastrophic forgetting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that making the low-rank updates input-sensitive through per-component gating allows the fine-tuning process to preserve pre-trained behavior by default. The gates are designed to activate selected rank-one components selectively, adding only a small number of parameters while maintaining the efficiency of the low-rank structure.
What carries the argument
Lightweight input-dependent gates over individual rank-one components of LoRA modules, which learn to activate only when they reduce the fine-tuning loss.
If this is right
- Models fine-tuned with DISeL show less forgetting on inputs outside the target distribution compared to static LoRA.
- The approach maintains competitive accuracy on fine-tuning tasks such as GLUE, mathematical reasoning, and code generation.
- Gate activation patterns reveal which layers and rank components concentrate the task-specific changes.
- Only a small number of additional parameters are required, preserving the parameter efficiency of LoRA.
Where Pith is reading between the lines
- This selective activation mechanism might extend to other parameter-efficient methods to handle continual learning scenarios better.
- Analyzing the learned gates could inform which parts of large models are most plastic for specific task types.
- Deployed models could potentially use these gates for input-aware behavior without full retraining.
Load-bearing premise
The input-dependent gates can be optimized to selectively activate only those rank-one components that improve the fine-tuning objective without introducing instability or high computational overhead.
What would settle it
If experiments show that DISeL does not reduce forgetting metrics compared to LoRA on the tested models and tasks, or if the gates fail to remain mostly inactive on out-of-distribution inputs.
Figures
read the original abstract
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method, yet its learned correction is static: the same low-rank update is applied to every input. This input-agnostic approach creates an inevitable compromise between adapting to the fine-tuning distribution and preserving pre-trained behavior on inputs outside that distribution, contributing to catastrophic forgetting. We introduce DISeL (Dynamic Input-Sensitive LoRA), which augments LoRA modules with lightweight input-dependent gates over individual rank-one components. The gating mechanism is designed to preserve the pre-trained model's behavior by default, while training learns to activate selected components that reduce the fine-tuning loss. DISeL adds only a small number of parameters and preserves the low-rank structure. Across RoBERTa on GLUE, and Llama and Mistral models fine-tuned for mathematical reasoning and code generation, DISeL reduces forgetting relative to LoRA and related variants while maintaining competitive fine-tuning accuracy. In addition, the learned gate activations provide an interpretable diagnostic view of which layers and rank components are most activated during fine-tuning, giving insight into where task-specific adaptation is concentrated. Code available at https://github.com/alizindari/DISeL .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DISeL, an augmentation to LoRA that introduces lightweight input-dependent gates over individual rank-one components. The gates are designed to default to preserving pre-trained behavior while learning to activate selected components only when they reduce fine-tuning loss. Empirical results are reported on RoBERTa fine-tuned on GLUE and on Llama/Mistral models for mathematical reasoning and code generation, claiming reduced forgetting relative to LoRA and related variants with competitive accuracy and added interpretability from gate activations.
Significance. If the central empirical claim holds after proper controls, the work would offer a practical route to input-sensitive parameter-efficient adaptation that mitigates catastrophic forgetting without substantial overhead, while the gate-activation diagnostics could provide useful insight into where task-specific updates concentrate in large models.
major comments (3)
- [§4] §4 (Experimental results): the comparisons to LoRA variants do not include an ablation with static (non-input-dependent) extra parameters or a fixed gating structure; without this control it remains unclear whether the reported forgetting reduction is driven by the input-sensitive mechanism or by the mere presence of additional trainable parameters.
- [§4] §4 and associated tables: no standard deviations, multiple random seeds, or statistical significance tests are reported for the forgetting and accuracy metrics across model families; this leaves the consistency of the gains difficult to assess.
- [§3] §3 (Method): the description of gate initialization and the training objective that enforces default preservation of pre-trained behavior lacks sufficient detail on hyper-parameters and regularization to evaluate whether the claimed stability is achieved by construction or by tuning.
minor comments (2)
- [Abstract] The abstract refers to 'related variants' without naming them; the experimental section should explicitly list the baselines (e.g., DoRA, VeRA) used for comparison.
- [Figures] Figure captions and axis labels for gate-activation heatmaps should include the exact layer indices and rank indices shown to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense of our work while agreeing to revisions that strengthen the empirical claims and clarity.
read point-by-point responses
-
Referee: [§4] §4 (Experimental results): the comparisons to LoRA variants do not include an ablation with static (non-input-dependent) extra parameters or a fixed gating structure; without this control it remains unclear whether the reported forgetting reduction is driven by the input-sensitive mechanism or by the mere presence of additional trainable parameters.
Authors: We agree that this ablation would more cleanly isolate the contribution of input dependence. Our existing comparisons are to LoRA and related methods such as DoRA, but these do not hold the number of additional trainable parameters exactly fixed while removing input sensitivity. We will add the requested control (static extra rank-one updates without gates, or a fixed non-input-dependent gating structure) in the revised experiments to demonstrate that the forgetting reduction arises from the dynamic mechanism rather than parameter count alone. revision: yes
-
Referee: [§4] §4 and associated tables: no standard deviations, multiple random seeds, or statistical significance tests are reported for the forgetting and accuracy metrics across model families; this leaves the consistency of the gains difficult to assess.
Authors: We concur that reporting variability is important for assessing reliability. In the revised manuscript we will rerun the key experiments across at least three random seeds, report means and standard deviations for both accuracy and forgetting metrics on RoBERTa/GLUE and the Llama/Mistral tasks, and include statistical significance tests (e.g., paired t-tests) to quantify the consistency of improvements across model families. revision: yes
-
Referee: [§3] §3 (Method): the description of gate initialization and the training objective that enforces default preservation of pre-trained behavior lacks sufficient detail on hyper-parameters and regularization to evaluate whether the claimed stability is achieved by construction or by tuning.
Authors: We will expand §3 with the missing details. Specifically, we will describe the gate initialization (gates are initialized to strongly favor the identity/pre-trained state), the precise form of the auxiliary loss or regularization term that penalizes unnecessary activation on out-of-distribution inputs, and the full set of hyper-parameters (learning rates, regularization coefficients, temperature, etc.) used in all reported runs. This will make explicit how default preservation is encouraged both by architecture and by the objective. revision: yes
Circularity Check
No circularity: empirical augmentation with independent experimental validation
full rationale
The paper proposes DISeL as a practical extension of LoRA by adding lightweight input-dependent gates over rank-one components. No derivation chain exists that reduces a claimed result to its own inputs by construction. The core claims rest on reported fine-tuning accuracy and forgetting metrics across RoBERTa/GLUE, Llama, and Mistral experiments, which are externally falsifiable against baselines. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The method is self-contained as an algorithmic design whose value is assessed by standard empirical comparison rather than internal redefinition.
Axiom & Free-Parameter Ledger
free parameters (1)
- gate parameters
axioms (1)
- domain assumption Standard gradient-based optimization can jointly train the base LoRA weights and the new gates without instability.
invented entities (1)
-
input-dependent gate over rank-one components
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Jack Bandy and Nicholas Vincent. Addressing "documentation debt" in machine learning research: A retrospective datasheet for bookcorpus.arXiv preprint arXiv:2105.05241, 2021
-
[2]
LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024
Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024
work page 2024
-
[3]
PIQA: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[4]
Eric L Buehler and Markus J Buehler. X-LoRA: Mixture of low-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and molecular design.APL Machine Learning, 2(2), 2024
work page 2024
-
[5]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Nan Chen, Soledad Villar, and Soufiane Hayou. Learning rate scaling across LoRA ranks and transfer to full finetuning.arXiv preprint arXiv:2602.06204, 2026
-
[7]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers)....
work page 2019
-
[8]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
QLoRA: Efficient finetuning of quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2023
work page 2023
-
[11]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers).Association for Computationa...
work page 2019
-
[12]
Automatically constructing a corpus of sentential para- phrases
William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential para- phrases. InProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005
work page 2005
-
[13]
How abilities in large language models are affected by supervised fine-tuning data composition
Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 177–198, 2024
work page 2024
-
[14]
AuroRA: Breaking low-rank bottleneck of LoRA with nonlinear mapping
Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. AuroRA: Breaking low-rank bottleneck of LoRA with nonlinear mapping. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2025
work page 2025
-
[15]
Gated LoRA: Dual-purpose projections for parameter-efficient mini-expert fine-tuning
SooHwan Eom, Hee Suk Yoon, Eunseop Yoon, Mark A Hasegawa-Johnson, and Chang D Yoon. Gated LoRA: Dual-purpose projections for parameter-efficient mini-expert fine-tuning. InSubmitted to International Conference on Learning Representations, 2025
work page 2025
-
[16]
LoRA+: Efficient low rank adaptation of large models
Soufiane Hayou, Nikhil Ghosh, and Bin Yu. LoRA+: Efficient low rank adaptation of large models. InProceedings of the 41st International Conference on Machine Learning. PMLR, 2024
work page 2024
-
[17]
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015
work page 2015
-
[18]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InThe Ninth Interna- tional Conference on Learning Representations, 2021
work page 2021
-
[19]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, 2022
work page 2022
-
[20]
LoraHub: Efficient cross-task generalization via dynamic LoRA composition
Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. LoraHub: Efficient cross-task generalization via dynamic LoRA composition. InFirst Conference on Language Modeling, 2024
work page 2024
-
[21]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B, 2023
work page 2023
-
[22]
Xiaowen Jiang, Xun Wang, and Sebastian U. Stich. LoRAM: Low-rank adaptation of large language models on manifold. InSparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference, 2025
work page 2025
-
[23]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.arXiv preprint arXiv:2009.13081, 2020
-
[24]
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611, 2017. 11
work page 2017
-
[25]
VeRA: Vector-based random matrix adaptation
Dawid Kopiczko, Tijmen Blankevoort, and Yuki Asano. VeRA: Vector-based random matrix adaptation. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[26]
Mixture of experts meets prompt-based continual learning
Minh Le, An Nguyen, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Van Ngo, and Nhat Ho. Mixture of experts meets prompt-based continual learning. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2024
work page 2024
-
[27]
Gated integration of low-rank adaptation for continual learning of large language models
Yan-Shuo Liang, Jia-Rui Chen, and Wu-Jun Li. Gated integration of low-rank adaptation for continual learning of large language models. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2025
work page 2025
-
[28]
DoRA: Weight-decomposed low-rank adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. In Proceedings of the 41st International Conference on Machine Learning. PMLR, 2024
work page 2024
-
[29]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[30]
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025
work page 2025
-
[31]
WizardCoder: Empowering code large language models with Evol-Instruct
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. WizardCoder: Empowering code large language models with Evol-Instruct. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[32]
CC-News-En: A large English news corpus
Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R Trippas, J Shane Culpepper, and Alistair Moffat. CC-News-En: A large English news corpus. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020
work page 2020
-
[33]
RanPAC: Random projections and pre-trained models for continual learning
Mark D McDonnell, Dong Gong, Amin Parvaneh, Ehsan Abbasnejad, and Anton Van den Hengel. RanPAC: Random projections and pre-trained models for continual learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2023
work page 2023
-
[34]
PiSSA: Principal singular values and singular vectors adaptation of large language models
Fanxu Meng, Zhaohui Wang, and Muhan Zhang. PiSSA: Principal singular values and singular vectors adaptation of large language models. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2024
work page 2024
-
[35]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InThe Fifth International Conference on Learning Representations, 2017
work page 2017
-
[36]
Can a suit of armor conduct electricity? A new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
work page 2018
-
[37]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2022
work page 2022
-
[38]
Language models are unsupervised multitask learners, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019
work page 2019
-
[39]
SQuAD: 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016
work page 2016
-
[40]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 2021
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 2021. 12
work page 2021
-
[42]
LoRA vs full fine-tuning: An illusion of equivalence
Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2025
work page 2025
-
[43]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013
work page 2013
-
[44]
CommonsenseQA: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long and Short Papers). Association for Computationa...
work page 2019
-
[45]
HydraLoRA: An asymmetric LoRA architecture for efficient fine-tuning
Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Chengzhong Xu. HydraLoRA: An asymmetric LoRA architecture for efficient fine-tuning. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2024
work page 2024
-
[46]
Llama 2: Open foundation and fine-tuned chat models, 2023
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page 2023
-
[47]
GLUE: A multi-task benchmark and analysis platform for natural language understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–355, 2018
work page 2018
-
[48]
Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality
Liyuan Wang, Jingyi Xie, Xingxing Zhang, Mingyi Huang, Hang Su, and Jun Zhu. Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2023
work page 2023
-
[49]
Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024
work page 2024
-
[50]
LoRA-GA: Low-rank adaptation with gradient ap- proximation
Shaowen Wang, Linxi Yu, and Jian Li. LoRA-GA: Low-rank adaptation with gradient ap- proximation. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2024
work page 2024
-
[51]
S-prompts learning with pre-trained trans- formers: An Occam’s razor for domain incremental learning
Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained trans- formers: An Occam’s razor for domain incremental learning. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2022
work page 2022
-
[52]
DualPrompt: Complementary prompting for rehearsal-free continual learning
Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. DualPrompt: Complementary prompting for rehearsal-free continual learning. InEuropean Conference on Computer Vision, pages 631–648. Springer, 2022
work page 2022
-
[53]
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments.Transactions of the Association for Computational Linguistics, 7:625–641, 2019. 13
work page 2019
-
[54]
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InThe Tenth International Conference on Learning Representations, 2022
work page 2022
-
[55]
Magicoder: Empow- ering code generation with OSS-Instruct
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empow- ering code generation with OSS-Instruct. InProceedings of the 41st International Conference on Machine Learning. PMLR, 2024
work page 2024
-
[56]
Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InProceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, 2017
work page 2017
-
[57]
A broad-coverage challenge corpus for sentence understanding through inference
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2018
work page 2018
-
[58]
Xun Wu, Shaohan Huang, and Furu Wei. Mixture of LoRA experts. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[59]
Metamath: Bootstrap your own mathematical questions for large language models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[60]
Mammoth: Building math generalist models through hybrid instruction tuning
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[61]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[62]
AdaLoRA: Adaptive budget allocation for parameter- efficient fine-tuning
Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. AdaLoRA: Adaptive budget allocation for parameter- efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023. 14 Appendix I Complete details of the motivating example This appendix gives the full d...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.