pith. sign in

arxiv: 2512.04844 · v2 · submitted 2025-12-04 · 💻 cs.CL · cs.AI

Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates

Pith reviewed 2026-05-17 01:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords catastrophic forgettingLLM adaptationlanguage modelsparameter freezinglow-resource learningmultilingual modelscontinual learninginstruct tuning
0
0 comments X

The pith

By identifying and freezing critical parameters from source data, LLMs adapt to new languages using only unlabeled target text while losing under 4 percent of original performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles catastrophic forgetting when adapting instruct LLMs to new languages in a low-resource setting that supplies only unlabeled target data. It introduces Source-Shielded Updates, which scores parameter importance on a small source dataset and then applies column-wise freezing to protect the most vital parameters before adaptation begins. This selective strategy cuts average degradation on source monolingual tasks to 3.4 percent for 7B models and 2.8 percent for 13B models, compared with roughly 20 percent drops under full fine-tuning. Target-language results remain competitive with or better than full fine-tuning across five typologically diverse languages and multiple benchmarks. A reader would care because the approach makes expanding linguistic coverage practical without sacrificing existing capabilities.

Core claim

The paper claims that scoring parameter importance on a small source dataset and then freezing the highest-scoring parameters in a column-wise manner before fine-tuning on unlabeled target data allows LLMs to retain source abilities with only 3.4 percent average degradation for 7B models and 2.8 percent for 13B models, versus 20.3 percent and 22.3 percent under full fine-tuning, while delivering target performance that matches or exceeds full fine-tuning on most benchmarks.

What carries the argument

Source-Shielded Updates (SSU), which scores parameters for source importance on limited source data and then freezes critical columns before target adaptation.

If this is right

  • SSU enables adaptation to new languages using only unlabeled target data without labeled supervision.
  • Target-language performance equals or surpasses full fine-tuning on all 7B benchmarks and most 13B benchmarks.
  • The method maintains effectiveness across five typologically diverse languages.
  • Source performance drops remain below 4 percent on average for both 7B and 13B models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selective-freezing idea could apply to domain adaptation or other continual-learning problems beyond language expansion.
  • Refining the importance-scoring step might shrink the remaining 3 percent degradation even further.
  • Scaling the approach to models larger than 13B would test whether the protection effect holds at greater sizes.

Load-bearing premise

The parameter importance scoring method applied to a small set of source data reliably identifies the parameters most critical to preserving source abilities across the full range of downstream tasks.

What would settle it

Applying SSU to a new set of target languages or tasks and measuring average source-task degradation above 10 percent would challenge the central claim.

Figures

Figures reproduced from arXiv: 2512.04844 by Aline Villavicencio, Atsuki Yamaguchi, Nikolaos Aletras, Terufumi Morishita.

Figure 1
Figure 1. Figure 1: Overview of Source-Shielded Update (SSU). The method comprises three stages: importance scoring, column-wise mask generation, and con￾tinual pre-training on unlabeled target language data with the masks. We therefore introduce Source-Shielded Updates (SSU), a novel source-focused approach that proactively shields source knowledge before adaptation begins ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model performance (SSU-Wanda, HFT, GMT) on Igbo as target language across freezing [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Expanding the linguistic diversity of instruct large language models (LLMs) is crucial for global accessibility but is often hindered by the reliance on costly specialized target language labeled data and catastrophic forgetting during adaptation. We tackle this challenge under a realistic, low-resource constraint: adapting instruct LLMs using only unlabeled target language data. We introduce Source-Shielded Updates (SSU), a selective parameter update strategy that proactively preserves source knowledge. Using a small set of source data and a parameter importance scoring method, SSU identifies parameters critical to maintaining source abilities. It then applies a column-wise freezing strategy to protect these parameters before adaptation. Experiments across five typologically diverse languages and 7B and 13B models demonstrate that SSU successfully mitigates catastrophic forgetting. It reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B) on average, a stark contrast to the 20.3% and 22.3% from full fine-tuning. SSU also achieves target-language performance highly competitive with full fine-tuning, outperforming it on all benchmarks for 7B models and the majority for 13B models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Source-Shielded Updates (SSU) to adapt instruct LLMs to target languages using only unlabeled target data while mitigating catastrophic forgetting of source abilities. SSU computes parameter importance scores from a small source dataset and applies column-wise freezing of critical parameters before performing adaptation. Experiments on 7B and 13B models across five typologically diverse languages report that SSU limits average degradation on monolingual source tasks to 3.4% (7B) and 2.8% (13B), versus 20.3% and 22.3% for full fine-tuning, while achieving target-language performance that is competitive with or superior to full fine-tuning.

Significance. If the empirical results hold under broader validation, SSU offers a practical, low-resource technique for expanding LLM linguistic coverage without requiring labeled target data or suffering severe source forgetting. The selective freezing approach is a clear strength and could transfer to other continual-learning or domain-adaptation settings. The manuscript earns credit for consistent quantitative comparisons across model sizes and languages with explicit baseline contrasts.

major comments (2)
  1. [§3] §3 (Method): The parameter-importance scoring procedure is central to the headline claim yet relies on a small source corpus whose selection and coverage are not fully detailed. The skeptic concern is load-bearing here: if the scoring (gradient magnitude, Fisher, or similar) is dominated by the particular examples chosen, parameters critical to untested source behaviors (reasoning chains, long-context coherence) may remain unfrozen and overwritten. The reported source-task results may overlap with the scoring distribution, so they do not fully test generalization.
  2. [Experiments] Experiments section / Table reporting source degradation: Average degradations of 3.4 % / 2.8 % are presented without per-run variance, statistical significance tests, or results on held-out source tasks disjoint from the importance-scoring data. This weakens confidence that the protection generalizes beyond the scoring distribution, directly affecting the central claim of reliable source preservation.
minor comments (2)
  1. [Abstract] The abstract states results for 'five typologically diverse languages' but does not name them; adding the language list would improve immediate readability.
  2. [§4] Hyperparameter choices for the importance threshold or fraction of parameters frozen are mentioned as free parameters but lack explicit values or sensitivity analysis in the main text; moving these to a dedicated paragraph or appendix would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The parameter-importance scoring procedure is central to the headline claim yet relies on a small source corpus whose selection and coverage are not fully detailed. The skeptic concern is load-bearing here: if the scoring (gradient magnitude, Fisher, or similar) is dominated by the particular examples chosen, parameters critical to untested source behaviors (reasoning chains, long-context coherence) may remain unfrozen and overwritten. The reported source-task results may overlap with the scoring distribution, so they do not fully test generalization.

    Authors: We appreciate the referee's emphasis on the importance of detailing the source corpus for parameter scoring. In the revised manuscript, we will expand Section 3 with additional specifics on the source data: its approximate size, selection criteria (sampling from diverse instruction-following and reasoning examples to cover core source abilities), and the exact scoring method (gradient magnitude). We will also clarify that the small scoring set targets general parameter importance rather than being tied to specific evaluation examples, and add a limitations discussion acknowledging that while our source-task benchmarks test a range of behaviors including reasoning and coherence, exhaustive coverage of all possible source capabilities remains an inherent challenge in this low-resource setup. revision: yes

  2. Referee: Experiments section / Table reporting source degradation: Average degradations of 3.4 % / 2.8 % are presented without per-run variance, statistical significance tests, or results on held-out source tasks disjoint from the importance-scoring data. This weakens confidence that the protection generalizes beyond the scoring distribution, directly affecting the central claim of reliable source preservation.

    Authors: We agree that reporting variance and statistical tests would increase confidence in the results. We will revise the experiments section and tables to include per-run standard deviations and appropriate statistical significance tests (e.g., paired t-tests) comparing SSU against full fine-tuning. Regarding held-out source tasks, the evaluation benchmarks used are standard monolingual tasks that are disjoint from the minimal scoring examples; we will explicitly state this separation in the revised text and add a brief discussion of generalization. If further disjoint evaluations prove necessary, we will note this as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method validated on held-out benchmarks

full rationale

The paper introduces Source-Shielded Updates (SSU) as an empirical technique that scores parameter importance on a small source corpus and applies column-wise freezing before target-language adaptation. All performance claims (e.g., 3.4%/2.8% source degradation vs. 20.3%/22.3% for full fine-tuning) rest on direct experimental measurements against held-out monolingual source tasks and target-language benchmarks across five languages and two model sizes. No mathematical derivations, predictions, or uniqueness claims appear that reduce by construction to fitted quantities or self-citations; the central results are externally falsifiable via the reported benchmark comparisons and do not rely on any load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that parameter importance estimated from limited source examples generalizes to the full source capability set; no new physical entities or mathematical axioms are introduced.

free parameters (1)
  • importance threshold or fraction of parameters to freeze
    Controls the trade-off between source preservation and target adaptation; value chosen to achieve reported results.
axioms (1)
  • domain assumption Parameter importance scores derived from small source data accurately reflect parameters critical to source task performance
    Invoked when selecting which columns to freeze before adaptation.

pith-pipeline@v0.9.0 · 5525 in / 1351 out tokens · 57222 ms · 2026-05-17T01:44:07.661917+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

    cs.CL 2026-05 unverdicted novelty 6.0

    Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

    Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 9904--9923, Singap...

  3. [3]

    Mitigating catastrophic forgetting in language transfer via model merging

    Anton Alexandrov, Veselin Raychev, Mark Niklas M \"u ller, Ce Zhang, Martin Vechev, and Kristina Toutanova. Mitigating catastrophic forgetting in language transfer via model merging. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 17167--17186, Miami, Florida, USA, No...

  4. [4]

    Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III , pages =

    Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III, pp.\ 144–161, Berlin, Heidelberg, 2018. Springer-Verlag. ISBN 978-3-030-01218-2. doi:10.10...

  5. [5]

    PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transforma- tion and Graph Compilation

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, and others. PyTorch 2 : Faster machine learning through dynamic P ython byte...

  6. [6]

    The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants , url =

    Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of...

  7. [7]

    Lo RA learns less and forgets less

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. Lo RA learns less and forgets less. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=aloEru2qCG....

  8. [8]

    Smith, and Luke Zettlemoyer

    Terra Blevins, Tomasz Limisiewicz, Suchin Gururangan, Margaret Li, Hila Gonen, Noah A. Smith, and Luke Zettlemoyer. Breaking the curse of multilinguality with cross-lingual expert language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10822-...

  9. [9]

    Cendol: Open instruction-tuned generative large language models for I ndonesian languages

    Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Putri, Wawan Cenggoro, Jhonson Lee, Salsabil Akbar, Emmanuel Dave, Nuurshadieq Nuurshadieq, Muhammad Mahendra, Rr Putri, Bryan Wilie, Genta Winata, Alham Aji, Ayu Purwarianti, and Pascale Fung. Cendol: Open instruction-tuned generative large language models for I ndonesian languages. In Lun-Wei Ku, Andre...

  10. [10]

    Recall and learn: Fine-tuning deep pretrained language models with less forgetting

    Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 7870--7881, Online, November 20...

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

  12. [12]

    Efficient and effective text encoding for chinese llama and alpaca,

    Yiming Cui, Ziqing Yang, and Xin Yao. Efficient and effective text encoding for C hinese LLaMA and A lpaca. arXiv, abs/2304.08177, 2024. URL https://arxiv.org/abs/2304.08177

  13. [13]

    FLOR : On the effectiveness of language adaptation

    Severino Da Dalt, Joan Llop, Irene Baucells, Marc Pamies, Yishi Xu, Aitor Gonzalez-Agirre, and Marta Villegas. FLOR : On the effectiveness of language adaptation. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue (eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistic...

  14. [14]

    FlashAttention-2 : Faster attention with better parallelism and work partitioning

    Tri Dao. FlashAttention-2 : Faster attention with better parallelism and work partitioning. In Proceedings of the Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec

  15. [15]

    Episodic memory in lifelong language learning

    Cyprien de Masson d Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic memory in lifelong language learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_...

  16. [16]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and others. DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv, abs/2501.12948, 2025. URL ht...

  17. [17]

    Length-controlled AlpacaEval : A simple debiasing of automatic evaluators

    Yann Dubois, Percy Liang, and Tatsunori Hashimoto. Length-controlled AlpacaEval : A simple debiasing of automatic evaluators. In Proceedings of the First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=CybBmzWBX0

  18. [18]

    Emergent abilities of large language models under continued pre-training for language adaptation

    Ahmed Elhady, Eneko Agirre, and Mikel Artetxe. Emergent abilities of large language models under continued pre-training for language adaptation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 32174--...

  19. [19]

    LightEval : A lightweight framework for LLM evaluation

    Clémentine Fourrier, Nathan Habib, Thomas Wolf, and Lewis Tunstall. LightEval : A lightweight framework for LLM evaluation. https://github.com/huggingface/lighteval, 2023

  20. [20]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the Seventh International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7

  21. [21]

    S parse GPT : Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. S parse GPT : Massive language models can be accurately pruned in one-shot. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.\ 1032...

  22. [22]

    On the effectiveness of parameter-efficient fine-tuning

    Zihao Fu, Haoran Yang, Anthony Man-Cho So, Wai Lam, Lidong Bing, and Nigel Collier. On the effectiveness of parameter-efficient fine-tuning. Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (11): 0 12799--12807, Jun. 2023. doi:10.1609/aaai.v37i11.26505. URL https://ojs.aaai.org/index.php/AAAI/article/view/26505

  23. [23]

    Continual pre-training for cross-lingual LLM adaptation: Enhancing J apanese language capabilities

    Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, and Naoaki Okazaki. Continual pre-training for cross-lingual LLM adaptation: Enhancing J apanese language capabilities. In Proceedings of the First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=TQdd1VhWbe

  24. [24]

    Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and others. A framework for few-shot language model evaluation. https:/...

  25. [25]

    Gemma 3 Technical Report

    Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and others. Gemma 3 technical report. arXiv, abs/2503...

  26. [26]

    An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

    Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv, abs/1312.6211, 2015. URL https://arxiv.org/abs/1312.6211

  27. [27]

    Studies of mind and brain: neural principles of learning, perception, development, cognition, and motor control

    Stephen Grossberg. Studies of mind and brain: neural principles of learning, perception, development, cognition, and motor control. Boston studies in the philosophy of science; 70. D. Reidel Publishing Company, 1982. ISBN 9027713596

  28. [28]

    Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M

    Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. XL -sum: Large-scale multilingual abstractive summarization for 44 languages. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, ...

  29. [29]

    SMT : Fine-tuning large language models with sparse matrices

    Haoze He, Juncheng B Li, Xuan Jiang, and Heather Miller. SMT : Fine-tuning large language models with sparse matrices. In Proceedings of the Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=GbgCRJedQ7

  30. [30]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In Proceedings of the Nineth International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

  31. [31]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP . In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Lear...

  32. [32]

    Lo RA : Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In Proceedings of the Tenth International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  33. [33]

    EMR-Merging : Tuning-free high-performance model merging

    Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. EMR-Merging : Tuning-free high-performance model merging. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 122741--122769. Curran Associates, Inc., 2024 a . URL https://proc...

  34. [34]

    Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting , booktitle =

    Haoyang Huang, Tianyi Tang, Dongdong Zhang, Xin Zhao, Ting Song, Yan Xia, and Furu Wei. Not all languages are created equal in LLM s: Improving multilingual capability by cross-lingual-thought prompting. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 12365--12394, Singapore,...

  35. [35]

    Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal

    Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1...

  36. [36]

    Chat vector: A simple approach to equip LLM s with instruction following and model alignment in new languages

    Shih-Cheng Huang, Pin-Zu Li, Yu-chi Hsu, Kuang-Ming Chen, Yu Tung Lin, Shih-Kai Hsiao, Richard Tsai, and Hung-yi Lee. Chat vector: A simple approach to equip LLM s with instruction following and model alignment in new languages. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computati...

  37. [37]

    HFT : Half fine-tuning for large language models

    Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Weiran Xu, Yu Sun, and Hua Wu. HFT : Half fine-tuning for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 12791--12819, Vienna, Austri...

  38. [38]

    Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024

    Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, and Barry Haddow. EMMA-500 : Enhancing massively multilingual adaptation of large language models. arXiv, abs/2409.17892, 2025. URL https://arxiv.org/abs/2409.17892

  39. [39]

    Continual learning with node-importance based adaptive group sparse regularization

    Sangwon Jung, Hongjoon Ahn, Sungmin Cha, and Taesup Moon. Continual learning with node-importance based adaptive group sparse regularization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 3647--3658. Curran Associates, Inc., 2020. URL https://proceedings.neurips...

  40. [40]

    G lot LID : Language identification for low-resource languages

    Amir Hossein Kargaran, Ayyoob Imani, Fran c ois Yvon, and Hinrich Schuetze. G lot LID : Language identification for low-resource languages. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 6155--6218, Singapore, December 2023. Association for Computational Linguistics. doi:10....

  41. [41]

    Continual learning of a mixed sequence of similar and dissimilar tasks

    Zixuan Ke, Bing Liu, and Xingchang Huang. Continual learning of a mixed sequence of similar and dissimilar tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 18493--18504. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020...

  42. [42]

    doi: 10.1073/pnas.1611835114

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): ...

  43. [43]

    Parameter-level soft-masking for continual learning

    Tatsuya Konishi, Mori Kurokawa, Chihiro Ono, Zixuan Ke, Gyuhak Kim, and Bing Liu. Parameter-level soft-masking for continual learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine L...

  44. [44]

    MADLAD -400: A multilingual and document-level large audited dataset

    Sneha Kudugunta, Isaac Rayburn Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. MADLAD -400: A multilingual and document-level large audited dataset. In Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openrevie...

  45. [45]

    Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, and others

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, and others. Tulu 3: Pushing frontiers in open language mo...

  46. [46]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , year =

    Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario S a s ko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, and others. Datasets: A community library for natural...

  47. [47]

    Evolving subnetwork training for large language models

    Hanqi Li, Lu Chen, Da Ma, Zijian Wu, Su Zhu, and Kai Yu. Evolving subnetwork training for large language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Lear...

  48. [48]

    Enhancing large language model performance with gradient-based parameter selection

    Haoling Li, Xin Zhang, Xiao Liu, Yeyun Gong, Yifan Wang, Qi Chen, and Peng Cheng. Enhancing large language model performance with gradient-based parameter selection. Proceedings of the AAAI Conference on Artificial Intelligence, 39 0 (23): 0 24431--24439, Apr. 2025. doi:10.1609/aaai.v39i23.34621. URL https://ojs.aaai.org/index.php/AAAI/article/view/34621

  49. [49]

    Smart FRZ : An efficient training framework using attention-based layer freezing

    Sheng Li, Geng Yuan, Yue Dai, Youtao Zhang, Yanzhi Wang, and Xulong Tang. Smart FRZ : An efficient training framework using attention-based layer freezing. In Proceedings of the Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=i9UlAr1T_xl

  50. [50]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval : An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023 b

  51. [51]

    AutoFreeze : Automatically freezing model blocks to accelerate fine-tuning

    Yuhan Liu, Saurabh Agarwal, and Shivaram Venkataraman. AutoFreeze : Automatically freezing model blocks to accelerate fine-tuning. arXiv, abs/2102.01386, 2021. URL https://arxiv.org/abs/2102.01386

  52. [52]

    On surgical fine-tuning for language encoders

    Abhilasha Lodha, Gayatri Belapurkar, Saloni Chalkapurkar, Yuanming Tao, Reshmi Ghosh, Samyadeep Basu, Dmitrii Petrov, and Soundararajan Srinivasan. On surgical fine-tuning for language encoders. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 3105--3113, Singapore, December 2...

  53. [53]

    Sparsity-accelerated training for large language models

    Da Ma, Lu Chen, Pengyu Wang, Hongshen Xu, Hanqi Li, Liangtai Sun, Su Zhu, Shuai Fan, and Kai Yu. Sparsity-accelerated training for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 14696--14707, Bangkok, Thailand, August 2024. Association for Computatio...

  54. [54]

    PackNet : Adding multiple tasks to a single network by iterative pruning

    Arun Mallya and Svetlana Lazebnik. PackNet : Adding multiple tasks to a single network by iterative pruning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7765--7773, 2018. doi:10.1109/CVPR.2018.00810

  55. [55]

    Piggyback: Adapting a single network to multiple tasks by learning to mask weights

    Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, pp.\ 72–88, Berlin, Heidelberg, 2018. Springer-Verlag. ISBN 978-3-030-01224-3. doi:10.1007/978-3-030-012...

  56. [56]

    PEFT : State-of-the-art parameter-efficient fine-tuning methods

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. PEFT : State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022

  57. [57]

    An empirical comparison of vocabulary expansion and initialization approaches for language models

    Nandini Mundra, Aditya Nanda Kishore Khandavally, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, and Mitesh M Khapra. An empirical comparison of vocabulary expansion and initialization approaches for language models. In Libby Barak and Malihe Alikhani (eds.), Proceedings of the 28th Conference on Computational Natural Language Learning, pp.\ 84--104, M...

  58. [58]

    Efficient continual pre-training of LLM s for low-resource languages

    Arijit Nag, Soumen Chakrabarti, Animesh Mukherjee, and Niloy Ganguly. Efficient continual pre-training of LLM s for low-resource languages. In Weizhu Chen, Yi Yang, Mohammad Kachuee, and Xue-Yong Fu (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologie...

  59. [59]

    S ea LLM s - large language models for S outheast A sia

    Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. S ea LLM s - large language models for S outheast A sia. In Yixin Cao, Yang Feng, and Deyi Xiong (eds.), Proceedings of the 62nd ...

  60. [60]

    NLLB Team , Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia-Gonzalez, Prangthip Hansanti, and others. No language left behind: Scaling human-centered mach...

  61. [61]

    GPT-5 system card

    OpenAI . GPT-5 system card. https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf, 2025

  62. [62]

    GPT-4 Technical Report

    OpenAI , Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, and others. GPT-4 technical report. arXiv, abs/2303.08774, 2024. URL ht...

  63. [63]

    Continually adding new languages to multilingual language models

    Abraham Toluwase Owodunni and Sachin Kumar. Continually adding new languages to multilingual language models. arXiv, abs/2509.11414, 2025. URL https://arxiv.org/abs/2509.11414

  64. [64]

    Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning

    Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 57018--57049. Curran As...

  65. [65]

    Lottery ticket adaptation: Mitigating destructive interference in LLMs

    Ashwinee Panda, Berivan Isik, Xiangyu Qi, Sanmi Koyejo, Tsachy Weissman, and Prateek Mittal. Lottery ticket adaptation: Mitigating destructive interference in LLMs . arXiv, abs/2406.16797, 2024. URL https://arxiv.org/abs/2406.16797

  66. [66]

    chr F ++: words helping character n-grams

    Maja Popovi \'c . chr F ++: words helping character n-grams. In Ond r ej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, and Julia Kreutzer (eds.), Proceedings of the Second Conference on Machine Translation, pp.\ 612--618, Copenhagen, Denmark, September 2017. A...

  67. [67]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 53728--53741. Curran Associa...

  68. [68]

    Experience replay for continual learning

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_f...

  69. [69]

    Overcoming catastrophic forgetting with hard attention to the task

    Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 4548--4557. PMLR, 10--15 Jul 2018. URL https://pro...

  70. [70]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the Third International Conference on Learning Representations, pp.\ 1--14, 2015. URL https://arxiv.org/abs/1409.1556

  71. [71]

    Global MMLU : Understanding and addressing cultural and linguistic biases in multilingual evaluation

    Shivalika Singh, Angelika Romanou, Cl \'e mentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, and others. Global MMLU : Unders...

  72. [72]

    A simple and effective pruning approach for large language models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. In Proceedings of the Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PxoFut3dWW

  73. [73]

    Unlocking the potential of model merging for low-resource languages

    Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, and Yansong Feng. Unlocking the potential of model merging for low-resource languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 8705--8720, Miami, Florida, USA, November 2024. Associ...

  74. [74]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford A lpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  75. [75]

    Exploring Design Choices for Building Language-Specific LLM s

    Atula Tejaswi, Nilesh Gupta, and Eunsol Choi. Exploring design choices for building language-specific LLM s. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 10485--10500, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653/v1/20...

  76. [76]

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , shorttitle =

    Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Anna Korhonen, David Traum, and Llu \'i s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 5797--5808, Florence, It...

  77. [77]

    2 OLM o 2 furious ( COLM s version)

    Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, and others. 2 OLM o 2 furious ( COLM s version). In Proceedings of the...

  78. [78]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wenjin Wang, Yunqing Hu, Qianglong Chen, and Yin Zhang. Task difficulty aware parameter allocation & regularization for lifelong learning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 7776--7785, 2023. doi:10.1109/CVPR52729.2023.00751

  79. [79]

    Dai, and Quoc V Le

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In Proceedings of the Tenth International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR

  80. [80]

    On the impact of calibration data in post-training quantization and pruning

    Miles Williams and Nikolaos Aletras. On the impact of calibration data in post-training quantization and pruning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 10100--10118, Bangkok, Thailand, August 2024. Association for Comput...

Showing first 80 references.