Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
Pith reviewed 2026-05-17 01:44 UTC · model grok-4.3
The pith
By identifying and freezing critical parameters from source data, LLMs adapt to new languages using only unlabeled target text while losing under 4 percent of original performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that scoring parameter importance on a small source dataset and then freezing the highest-scoring parameters in a column-wise manner before fine-tuning on unlabeled target data allows LLMs to retain source abilities with only 3.4 percent average degradation for 7B models and 2.8 percent for 13B models, versus 20.3 percent and 22.3 percent under full fine-tuning, while delivering target performance that matches or exceeds full fine-tuning on most benchmarks.
What carries the argument
Source-Shielded Updates (SSU), which scores parameters for source importance on limited source data and then freezes critical columns before target adaptation.
If this is right
- SSU enables adaptation to new languages using only unlabeled target data without labeled supervision.
- Target-language performance equals or surpasses full fine-tuning on all 7B benchmarks and most 13B benchmarks.
- The method maintains effectiveness across five typologically diverse languages.
- Source performance drops remain below 4 percent on average for both 7B and 13B models.
Where Pith is reading between the lines
- The same selective-freezing idea could apply to domain adaptation or other continual-learning problems beyond language expansion.
- Refining the importance-scoring step might shrink the remaining 3 percent degradation even further.
- Scaling the approach to models larger than 13B would test whether the protection effect holds at greater sizes.
Load-bearing premise
The parameter importance scoring method applied to a small set of source data reliably identifies the parameters most critical to preserving source abilities across the full range of downstream tasks.
What would settle it
Applying SSU to a new set of target languages or tasks and measuring average source-task degradation above 10 percent would challenge the central claim.
Figures
read the original abstract
Expanding the linguistic diversity of instruct large language models (LLMs) is crucial for global accessibility but is often hindered by the reliance on costly specialized target language labeled data and catastrophic forgetting during adaptation. We tackle this challenge under a realistic, low-resource constraint: adapting instruct LLMs using only unlabeled target language data. We introduce Source-Shielded Updates (SSU), a selective parameter update strategy that proactively preserves source knowledge. Using a small set of source data and a parameter importance scoring method, SSU identifies parameters critical to maintaining source abilities. It then applies a column-wise freezing strategy to protect these parameters before adaptation. Experiments across five typologically diverse languages and 7B and 13B models demonstrate that SSU successfully mitigates catastrophic forgetting. It reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B) on average, a stark contrast to the 20.3% and 22.3% from full fine-tuning. SSU also achieves target-language performance highly competitive with full fine-tuning, outperforming it on all benchmarks for 7B models and the majority for 13B models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Source-Shielded Updates (SSU) to adapt instruct LLMs to target languages using only unlabeled target data while mitigating catastrophic forgetting of source abilities. SSU computes parameter importance scores from a small source dataset and applies column-wise freezing of critical parameters before performing adaptation. Experiments on 7B and 13B models across five typologically diverse languages report that SSU limits average degradation on monolingual source tasks to 3.4% (7B) and 2.8% (13B), versus 20.3% and 22.3% for full fine-tuning, while achieving target-language performance that is competitive with or superior to full fine-tuning.
Significance. If the empirical results hold under broader validation, SSU offers a practical, low-resource technique for expanding LLM linguistic coverage without requiring labeled target data or suffering severe source forgetting. The selective freezing approach is a clear strength and could transfer to other continual-learning or domain-adaptation settings. The manuscript earns credit for consistent quantitative comparisons across model sizes and languages with explicit baseline contrasts.
major comments (2)
- [§3] §3 (Method): The parameter-importance scoring procedure is central to the headline claim yet relies on a small source corpus whose selection and coverage are not fully detailed. The skeptic concern is load-bearing here: if the scoring (gradient magnitude, Fisher, or similar) is dominated by the particular examples chosen, parameters critical to untested source behaviors (reasoning chains, long-context coherence) may remain unfrozen and overwritten. The reported source-task results may overlap with the scoring distribution, so they do not fully test generalization.
- [Experiments] Experiments section / Table reporting source degradation: Average degradations of 3.4 % / 2.8 % are presented without per-run variance, statistical significance tests, or results on held-out source tasks disjoint from the importance-scoring data. This weakens confidence that the protection generalizes beyond the scoring distribution, directly affecting the central claim of reliable source preservation.
minor comments (2)
- [Abstract] The abstract states results for 'five typologically diverse languages' but does not name them; adding the language list would improve immediate readability.
- [§4] Hyperparameter choices for the importance threshold or fraction of parameters frozen are mentioned as free parameters but lack explicit values or sensitivity analysis in the main text; moving these to a dedicated paragraph or appendix would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [§3] §3 (Method): The parameter-importance scoring procedure is central to the headline claim yet relies on a small source corpus whose selection and coverage are not fully detailed. The skeptic concern is load-bearing here: if the scoring (gradient magnitude, Fisher, or similar) is dominated by the particular examples chosen, parameters critical to untested source behaviors (reasoning chains, long-context coherence) may remain unfrozen and overwritten. The reported source-task results may overlap with the scoring distribution, so they do not fully test generalization.
Authors: We appreciate the referee's emphasis on the importance of detailing the source corpus for parameter scoring. In the revised manuscript, we will expand Section 3 with additional specifics on the source data: its approximate size, selection criteria (sampling from diverse instruction-following and reasoning examples to cover core source abilities), and the exact scoring method (gradient magnitude). We will also clarify that the small scoring set targets general parameter importance rather than being tied to specific evaluation examples, and add a limitations discussion acknowledging that while our source-task benchmarks test a range of behaviors including reasoning and coherence, exhaustive coverage of all possible source capabilities remains an inherent challenge in this low-resource setup. revision: yes
-
Referee: Experiments section / Table reporting source degradation: Average degradations of 3.4 % / 2.8 % are presented without per-run variance, statistical significance tests, or results on held-out source tasks disjoint from the importance-scoring data. This weakens confidence that the protection generalizes beyond the scoring distribution, directly affecting the central claim of reliable source preservation.
Authors: We agree that reporting variance and statistical tests would increase confidence in the results. We will revise the experiments section and tables to include per-run standard deviations and appropriate statistical significance tests (e.g., paired t-tests) comparing SSU against full fine-tuning. Regarding held-out source tasks, the evaluation benchmarks used are standard monolingual tasks that are disjoint from the minimal scoring examples; we will explicitly state this separation in the revised text and add a brief discussion of generalization. If further disjoint evaluations prove necessary, we will note this as future work. revision: partial
Circularity Check
No circularity: empirical method validated on held-out benchmarks
full rationale
The paper introduces Source-Shielded Updates (SSU) as an empirical technique that scores parameter importance on a small source corpus and applies column-wise freezing before target-language adaptation. All performance claims (e.g., 3.4%/2.8% source degradation vs. 20.3%/22.3% for full fine-tuning) rest on direct experimental measurements against held-out monolingual source tasks and target-language benchmarks across five languages and two model sizes. No mathematical derivations, predictions, or uniqueness claims appear that reduce by construction to fitted quantities or self-citations; the central results are externally falsifiable via the reported benchmark comparisons and do not rely on any load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- importance threshold or fraction of parameters to freeze
axioms (1)
- domain assumption Parameter importance scores derived from small source data accurately reflect parameters critical to source task performance
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SSU identifies parameters critical to maintaining source abilities... applies a column-wise freezing strategy... sij = |θij| · ||Xj||2... top k% to freeze (50% by default)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 9904--9923, Singap...
-
[3]
Mitigating catastrophic forgetting in language transfer via model merging
Anton Alexandrov, Veselin Raychev, Mark Niklas M \"u ller, Ce Zhang, Martin Vechev, and Kristina Toutanova. Mitigating catastrophic forgetting in language transfer via model merging. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 17167--17186, Miami, Florida, USA, No...
-
[4]
Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III, pp.\ 144–161, Berlin, Heidelberg, 2018. Springer-Verlag. ISBN 978-3-030-01218-2. doi:10.10...
-
[5]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, and others. PyTorch 2 : Faster machine learning through dynamic P ython byte...
-
[6]
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants , url =
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of...
-
[7]
Lo RA learns less and forgets less
Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. Lo RA learns less and forgets less. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=aloEru2qCG....
work page 2024
-
[8]
Terra Blevins, Tomasz Limisiewicz, Suchin Gururangan, Margaret Li, Hila Gonen, Noah A. Smith, and Luke Zettlemoyer. Breaking the curse of multilinguality with cross-lingual expert language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10822-...
-
[9]
Cendol: Open instruction-tuned generative large language models for I ndonesian languages
Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Putri, Wawan Cenggoro, Jhonson Lee, Salsabil Akbar, Emmanuel Dave, Nuurshadieq Nuurshadieq, Muhammad Mahendra, Rr Putri, Bryan Wilie, Genta Winata, Alham Aji, Ayu Purwarianti, and Pascale Fung. Cendol: Open instruction-tuned generative large language models for I ndonesian languages. In Lun-Wei Ku, Andre...
-
[10]
Recall and learn: Fine-tuning deep pretrained language models with less forgetting
Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 7870--7881, Online, November 20...
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Efficient and effective text encoding for chinese llama and alpaca,
Yiming Cui, Ziqing Yang, and Xin Yao. Efficient and effective text encoding for C hinese LLaMA and A lpaca. arXiv, abs/2304.08177, 2024. URL https://arxiv.org/abs/2304.08177
-
[13]
FLOR : On the effectiveness of language adaptation
Severino Da Dalt, Joan Llop, Irene Baucells, Marc Pamies, Yishi Xu, Aitor Gonzalez-Agirre, and Marta Villegas. FLOR : On the effectiveness of language adaptation. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue (eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistic...
work page 2024
-
[14]
FlashAttention-2 : Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2 : Faster attention with better parallelism and work partitioning. In Proceedings of the Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec
work page 2024
-
[15]
Episodic memory in lifelong language learning
Cyprien de Masson d Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic memory in lifelong language learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_...
work page 2019
-
[16]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and others. DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv, abs/2501.12948, 2025. URL ht...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Length-controlled AlpacaEval : A simple debiasing of automatic evaluators
Yann Dubois, Percy Liang, and Tatsunori Hashimoto. Length-controlled AlpacaEval : A simple debiasing of automatic evaluators. In Proceedings of the First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=CybBmzWBX0
work page 2024
-
[18]
Emergent abilities of large language models under continued pre-training for language adaptation
Ahmed Elhady, Eneko Agirre, and Mikel Artetxe. Emergent abilities of large language models under continued pre-training for language adaptation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 32174--...
-
[19]
LightEval : A lightweight framework for LLM evaluation
Clémentine Fourrier, Nathan Habib, Thomas Wolf, and Lewis Tunstall. LightEval : A lightweight framework for LLM evaluation. https://github.com/huggingface/lighteval, 2023
work page 2023
-
[20]
The lottery ticket hypothesis: Finding sparse, trainable neural networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the Seventh International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7
work page 2019
-
[21]
S parse GPT : Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. S parse GPT : Massive language models can be accurately pruned in one-shot. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.\ 1032...
work page 2023
-
[22]
On the effectiveness of parameter-efficient fine-tuning
Zihao Fu, Haoran Yang, Anthony Man-Cho So, Wai Lam, Lidong Bing, and Nigel Collier. On the effectiveness of parameter-efficient fine-tuning. Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (11): 0 12799--12807, Jun. 2023. doi:10.1609/aaai.v37i11.26505. URL https://ojs.aaai.org/index.php/AAAI/article/view/26505
-
[23]
Continual pre-training for cross-lingual LLM adaptation: Enhancing J apanese language capabilities
Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, and Naoaki Okazaki. Continual pre-training for cross-lingual LLM adaptation: Enhancing J apanese language capabilities. In Proceedings of the First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=TQdd1VhWbe
work page 2024
-
[24]
Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and others. A framework for few-shot language model evaluation. https:/...
-
[25]
Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and others. Gemma 3 technical report. arXiv, abs/2503...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv, abs/1312.6211, 2015. URL https://arxiv.org/abs/1312.6211
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[27]
Stephen Grossberg. Studies of mind and brain: neural principles of learning, perception, development, cognition, and motor control. Boston studies in the philosophy of science; 70. D. Reidel Publishing Company, 1982. ISBN 9027713596
work page 1982
-
[28]
Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M
Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. XL -sum: Large-scale multilingual abstractive summarization for 44 languages. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, ...
-
[29]
SMT : Fine-tuning large language models with sparse matrices
Haoze He, Juncheng B Li, Xuan Jiang, and Heather Miller. SMT : Fine-tuning large language models with sparse matrices. In Proceedings of the Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=GbgCRJedQ7
work page 2025
-
[30]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In Proceedings of the Nineth International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[31]
Parameter-efficient transfer learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP . In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Lear...
work page 2019
-
[32]
Lo RA : Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In Proceedings of the Tenth International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[33]
EMR-Merging : Tuning-free high-performance model merging
Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. EMR-Merging : Tuning-free high-performance model merging. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 122741--122769. Curran Associates, Inc., 2024 a . URL https://proc...
work page 2024
-
[34]
Haoyang Huang, Tianyi Tang, Dongdong Zhang, Xin Zhao, Ting Song, Yan Xia, and Furu Wei. Not all languages are created equal in LLM s: Improving multilingual capability by cross-lingual-thought prompting. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 12365--12394, Singapore,...
-
[35]
Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal
Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1...
-
[36]
Shih-Cheng Huang, Pin-Zu Li, Yu-chi Hsu, Kuang-Ming Chen, Yu Tung Lin, Shih-Kai Hsiao, Richard Tsai, and Hung-yi Lee. Chat vector: A simple approach to equip LLM s with instruction following and model alignment in new languages. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computati...
-
[37]
HFT : Half fine-tuning for large language models
Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Weiran Xu, Yu Sun, and Hua Wu. HFT : Half fine-tuning for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 12791--12819, Vienna, Austri...
-
[38]
Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, and Barry Haddow. EMMA-500 : Enhancing massively multilingual adaptation of large language models. arXiv, abs/2409.17892, 2025. URL https://arxiv.org/abs/2409.17892
-
[39]
Continual learning with node-importance based adaptive group sparse regularization
Sangwon Jung, Hongjoon Ahn, Sungmin Cha, and Taesup Moon. Continual learning with node-importance based adaptive group sparse regularization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 3647--3658. Curran Associates, Inc., 2020. URL https://proceedings.neurips...
work page 2020
-
[40]
G lot LID : Language identification for low-resource languages
Amir Hossein Kargaran, Ayyoob Imani, Fran c ois Yvon, and Hinrich Schuetze. G lot LID : Language identification for low-resource languages. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 6155--6218, Singapore, December 2023. Association for Computational Linguistics. doi:10....
-
[41]
Continual learning of a mixed sequence of similar and dissimilar tasks
Zixuan Ke, Bing Liu, and Xingchang Huang. Continual learning of a mixed sequence of similar and dissimilar tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 18493--18504. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020...
work page 2020
-
[42]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): ...
-
[43]
Parameter-level soft-masking for continual learning
Tatsuya Konishi, Mori Kurokawa, Chihiro Ono, Zixuan Ke, Gyuhak Kim, and Bing Liu. Parameter-level soft-masking for continual learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine L...
work page 2023
-
[44]
MADLAD -400: A multilingual and document-level large audited dataset
Sneha Kudugunta, Isaac Rayburn Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. MADLAD -400: A multilingual and document-level large audited dataset. In Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openrevie...
work page 2023
-
[45]
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, and others. Tulu 3: Pushing frontiers in open language mo...
work page 2025
-
[46]
Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario S a s ko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, and others. Datasets: A community library for natural...
-
[47]
Evolving subnetwork training for large language models
Hanqi Li, Lu Chen, Da Ma, Zijian Wu, Su Zhu, and Kai Yu. Evolving subnetwork training for large language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Lear...
work page 2024
-
[48]
Enhancing large language model performance with gradient-based parameter selection
Haoling Li, Xin Zhang, Xiao Liu, Yeyun Gong, Yifan Wang, Qi Chen, and Peng Cheng. Enhancing large language model performance with gradient-based parameter selection. Proceedings of the AAAI Conference on Artificial Intelligence, 39 0 (23): 0 24431--24439, Apr. 2025. doi:10.1609/aaai.v39i23.34621. URL https://ojs.aaai.org/index.php/AAAI/article/view/34621
-
[49]
Smart FRZ : An efficient training framework using attention-based layer freezing
Sheng Li, Geng Yuan, Yue Dai, Youtao Zhang, Yanzhi Wang, and Xulong Tang. Smart FRZ : An efficient training framework using attention-based layer freezing. In Proceedings of the Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=i9UlAr1T_xl
work page 2023
- [50]
-
[51]
AutoFreeze : Automatically freezing model blocks to accelerate fine-tuning
Yuhan Liu, Saurabh Agarwal, and Shivaram Venkataraman. AutoFreeze : Automatically freezing model blocks to accelerate fine-tuning. arXiv, abs/2102.01386, 2021. URL https://arxiv.org/abs/2102.01386
-
[52]
On surgical fine-tuning for language encoders
Abhilasha Lodha, Gayatri Belapurkar, Saloni Chalkapurkar, Yuanming Tao, Reshmi Ghosh, Samyadeep Basu, Dmitrii Petrov, and Soundararajan Srinivasan. On surgical fine-tuning for language encoders. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 3105--3113, Singapore, December 2...
-
[53]
Sparsity-accelerated training for large language models
Da Ma, Lu Chen, Pengyu Wang, Hongshen Xu, Hanqi Li, Liangtai Sun, Su Zhu, Shuai Fan, and Kai Yu. Sparsity-accelerated training for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 14696--14707, Bangkok, Thailand, August 2024. Association for Computatio...
-
[54]
PackNet : Adding multiple tasks to a single network by iterative pruning
Arun Mallya and Svetlana Lazebnik. PackNet : Adding multiple tasks to a single network by iterative pruning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7765--7773, 2018. doi:10.1109/CVPR.2018.00810
-
[55]
Piggyback: Adapting a single network to multiple tasks by learning to mask weights
Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, pp.\ 72–88, Berlin, Heidelberg, 2018. Springer-Verlag. ISBN 978-3-030-01224-3. doi:10.1007/978-3-030-012...
-
[56]
PEFT : State-of-the-art parameter-efficient fine-tuning methods
Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. PEFT : State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022
work page 2022
-
[57]
An empirical comparison of vocabulary expansion and initialization approaches for language models
Nandini Mundra, Aditya Nanda Kishore Khandavally, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, and Mitesh M Khapra. An empirical comparison of vocabulary expansion and initialization approaches for language models. In Libby Barak and Malihe Alikhani (eds.), Proceedings of the 28th Conference on Computational Natural Language Learning, pp.\ 84--104, M...
-
[58]
Efficient continual pre-training of LLM s for low-resource languages
Arijit Nag, Soumen Chakrabarti, Animesh Mukherjee, and Niloy Ganguly. Efficient continual pre-training of LLM s for low-resource languages. In Weizhu Chen, Yi Yang, Mohammad Kachuee, and Xue-Yong Fu (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologie...
-
[59]
S ea LLM s - large language models for S outheast A sia
Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. S ea LLM s - large language models for S outheast A sia. In Yixin Cao, Yang Feng, and Deyi Xiong (eds.), Proceedings of the 62nd ...
-
[60]
NLLB Team , Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia-Gonzalez, Prangthip Hansanti, and others. No language left behind: Scaling human-centered mach...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
OpenAI . GPT-5 system card. https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf, 2025
work page 2025
-
[62]
OpenAI , Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, and others. GPT-4 technical report. arXiv, abs/2303.08774, 2024. URL ht...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Continually adding new languages to multilingual language models
Abraham Toluwase Owodunni and Sachin Kumar. Continually adding new languages to multilingual language models. arXiv, abs/2509.11414, 2025. URL https://arxiv.org/abs/2509.11414
-
[64]
Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning
Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 57018--57049. Curran As...
-
[65]
Lottery ticket adaptation: Mitigating destructive interference in LLMs
Ashwinee Panda, Berivan Isik, Xiangyu Qi, Sanmi Koyejo, Tsachy Weissman, and Prateek Mittal. Lottery ticket adaptation: Mitigating destructive interference in LLMs . arXiv, abs/2406.16797, 2024. URL https://arxiv.org/abs/2406.16797
-
[66]
chr F ++: words helping character n-grams
Maja Popovi \'c . chr F ++: words helping character n-grams. In Ond r ej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, and Julia Kreutzer (eds.), Proceedings of the Second Conference on Machine Translation, pp.\ 612--618, Copenhagen, Denmark, September 2017. A...
-
[67]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 53728--53741. Curran Associa...
work page 2023
-
[68]
Experience replay for continual learning
David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_f...
work page 2019
-
[69]
Overcoming catastrophic forgetting with hard attention to the task
Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 4548--4557. PMLR, 10--15 Jul 2018. URL https://pro...
work page 2018
-
[70]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the Third International Conference on Learning Representations, pp.\ 1--14, 2015. URL https://arxiv.org/abs/1409.1556
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[71]
Global MMLU : Understanding and addressing cultural and linguistic biases in multilingual evaluation
Shivalika Singh, Angelika Romanou, Cl \'e mentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, and others. Global MMLU : Unders...
-
[72]
A simple and effective pruning approach for large language models
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. In Proceedings of the Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PxoFut3dWW
work page 2024
-
[73]
Unlocking the potential of model merging for low-resource languages
Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, and Yansong Feng. Unlocking the potential of model merging for low-resource languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 8705--8720, Miami, Florida, USA, November 2024. Associ...
- [74]
-
[75]
Exploring Design Choices for Building Language-Specific LLM s
Atula Tejaswi, Nilesh Gupta, and Eunsol Choi. Exploring design choices for building language-specific LLM s. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 10485--10500, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653/v1/20...
-
[76]
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 5797--5808, Florence, It...
-
[77]
2 OLM o 2 furious ( COLM s version)
Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, and others. 2 OLM o 2 furious ( COLM s version). In Proceedings of the...
work page 2025
-
[78]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Wenjin Wang, Yunqing Hu, Qianglong Chen, and Yin Zhang. Task difficulty aware parameter allocation & regularization for lifelong learning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 7776--7785, 2023. doi:10.1109/CVPR52729.2023.00751
-
[79]
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In Proceedings of the Tenth International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR
work page 2022
-
[80]
On the impact of calibration data in post-training quantization and pruning
Miles Williams and Nikolaos Aletras. On the impact of calibration data in post-training quantization and pruning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 10100--10118, Bangkok, Thailand, August 2024. Association for Comput...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.