pith. sign in

arxiv: 2606.00284 · v1 · pith:KR6M7OGQnew · submitted 2026-05-29 · 💻 cs.CL

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Pith reviewed 2026-06-28 21:57 UTC · model grok-4.3

classification 💻 cs.CL
keywords catastrophic forgettingcontinual pretrainingmultilingual language modelsparameter alignmentlayer freezingmodel mergingtranslationreading comprehension
0
0 comments X

The pith

Parameter alignment strategies reduce catastrophic forgetting during multilingual continual pretraining at low cost to new language gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper links forgetting of prior capabilities in large language models to parameter drift when continually pretraining on data from new languages. It introduces five layer-aware alignment methods including hard freezing of layers, soft regularization, post-hoc weight reversion, and model merging to counteract that drift. These are tested against unregularized baselines across 32 languages from five families on four axes: perplexity, reading comprehension, physical reasoning, and translation. Layer freezing and regularization preserve comprehension most effectively while post-hoc reversion produces the largest translation improvements. The results establish practical pairings of each method to the tasks it serves best.

Core claim

Forgetting in multilingual expert language models arises from parameter drift during continual pretraining on new languages. Five layer-aware alignment strategies—hard layer freezing, soft regularization, post-hoc weight reversion, and model merging—counter this drift and substantially reduce forgetting while preserving most of the ability to acquire the target languages. On benchmarks spanning 32 training languages plus held-out ones, freezing and regularization best maintain reading comprehension and reasoning, whereas post-hoc reversion delivers the strongest translation performance.

What carries the argument

Layer-aware parameter alignment strategies that directly counteract parameter drift during or after family-based continual pretraining of multilingual expert models.

If this is right

  • Layer freezing and regularization best preserve comprehension and reasoning after adding new languages.
  • Post-hoc weight reversion produces the largest gains on translation tasks.
  • These methods achieve reduced forgetting at minimal expense to acquisition of the target languages.
  • Language-family organization alone does not prevent loss of general knowledge needed for downstream tasks.
  • Deployment can pair specific alignment strategies to the performance axes that matter most for a given use case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment logic could be tested on continual pretraining that adds entirely new domains rather than languages.
  • If parameter drift is the main driver, similar lightweight alignment steps might apply to other continual-learning settings outside language models.
  • The acquisition-forgetting frontier mapped here suggests a tunable trade-off surface that future work could optimize with fewer than five discrete strategies.

Load-bearing premise

Forgetting in multilingual continual pretraining is driven by parameter drift and the chosen benchmarks across four axes adequately reflect real-world downstream performance and retained knowledge.

What would settle it

An experiment applying the alignment strategies yet observing no reduction in forgetting rates relative to the unregularized baselines on the same perplexity, comprehension, reasoning, and translation benchmarks.

Figures

Figures reproduced from arXiv: 2606.00284 by Sanchit Ahuja, Terra Blevins.

Figure 1
Figure 1. Figure 1: Left: Overview of parameter alignment strategies. The layer-aware methods regularize or replace middle-layer parameters while allowing the other layers to learn language-specific information; Expert Soup uniformly averages the baseline Experts. Right: Summarized downstream results; parameter alignment improves reading-comprehension retention, while Dense-Reverted preserves strong translation quality. Speci… view at source ↗
Figure 2
Figure 2. Figure 2: Held-out Belebele accuracy delta and FLO [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer interpolation between the base model and Dense CPT. All non-interpolated layers are kept at their [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Held-in Belebele accuracy under first-, middle-, and last-layer interpolation, broken down by language [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Held-in FLORES ChrF under first-, middle-, and last-layer interpolation, broken down by language family. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that parameter alignment mitigates catastrophic forgetting during continual pretraining (CPT) of multilingual expert language models. It links forgetting to parameter drift, introduces five layer-aware alignment strategies (hard layer freezing, soft regularization, post-hoc weight reversion, model merging, and one additional), and evaluates them against unregularized CPT baselines on benchmarks spanning 32 languages from five families plus held-out languages. The evaluation covers four axes (perplexity, reading comprehension, physical reasoning, translation) and concludes that alignment substantially reduces forgetting at minimal cost to language acquisition, with layer freezing/regularization optimal for comprehension and post-hoc reversion for translation.

Significance. If the results hold, this work maps the acquisition-forgetting frontier for family-expert CPT and supplies practical deployment guidelines that pair strategies to tasks. The systematic empirical comparison across languages, families, and multiple axes is a strength of the study.

major comments (1)
  1. [Evaluation section] Evaluation section (and abstract): The central claim that parameter alignment substantially reduces forgetting rests on the premise that the four evaluation axes sufficiently capture retention of general knowledge for downstream tasks. The manuscript provides no additional evidence or ablation showing that perplexity, reading comprehension, physical reasoning, and translation across the 32+ languages are representative rather than task-specific; this assumption is load-bearing for the generality of the reported reductions in forgetting.
minor comments (1)
  1. [Abstract] Abstract: The text states there are 'five layer-aware parameter alignment strategies' but then enumerates only four (hard layer freezing, soft regularization, post-hoc weight reversion, and model merging). Clarify the fifth strategy or correct the count.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation design. We address the major comment below.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (and abstract): The central claim that parameter alignment substantially reduces forgetting rests on the premise that the four evaluation axes sufficiently capture retention of general knowledge for downstream tasks. The manuscript provides no additional evidence or ablation showing that perplexity, reading comprehension, physical reasoning, and translation across the 32+ languages are representative rather than task-specific; this assumption is load-bearing for the generality of the reported reductions in forgetting.

    Authors: We appreciate the referee's point that stronger justification is needed for why these four axes adequately represent retention of general knowledge. Our choice was driven by the need to probe distinct facets of capability retention in a multilingual setting: perplexity for core language modeling, reading comprehension for textual understanding, physical reasoning for factual and inferential knowledge, and translation for cross-lingual transfer. These axes follow standard practice in multilingual LLM evaluation and exhibit consistent patterns across our 32+ languages and five families, lending support to the generality of the forgetting-mitigation results. That said, the current manuscript does not include an explicit discussion or ablation of task representativeness. In revision we will add a concise subsection in the Evaluation section that (a) articulates the rationale for the chosen axes with references to prior multilingual benchmarks, (b) notes their coverage of different capability types, and (c) acknowledges the inherent limits of any finite task suite. This addresses the concern without requiring new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison of alignment strategies on benchmarks

full rationale

The paper is an empirical study that evaluates five parameter alignment strategies against CPT baselines on perplexity, reading comprehension, physical reasoning, and translation benchmarks across 32 languages. No equations, derivations, or mathematical claims are present that could reduce to fitted quantities or self-definitions by construction. No uniqueness theorems, ansatzes, or load-bearing self-citations are invoked to justify core premises; results are presented as direct experimental outcomes. The work is self-contained against external benchmarks and does not rename known results or smuggle assumptions via citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5710 in / 866 out tokens · 26539 ms · 2026-06-28T21:57:41.144860+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 22 canonical work pages

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Madlad-400: A multilingual and document-level large audited dataset , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLM s

    Bandarkar, Lucas and Peng, Nanyun. The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLM s. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.10

  3. [3]

    2025 , eprint=

    Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models , author=. 2025 , eprint=

  4. [4]

    2024 , eprint=

    Maintaining Plasticity in Continual Learning via Regenerative Regularization , author=. 2024 , eprint=

  5. [5]

    The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

    Bandarkar, Lucas and Liang, Davis and Muller, Benjamin and Artetxe, Mikel and Shukla, Satya Narayan and Husa, Donald and Goyal, Naman and Krishnan, Abhinandan and Zettlemoyer, Luke and Khabsa, Madian. The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. Proceedings of the 62nd Annual Meeting of the Association for Com...

  6. [6]

    2025 , eprint=

    Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures , author=. 2025 , eprint=

  7. [7]

    2022 , eprint=

    No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=

  8. [8]

    chr F : character n-gram F -score for automatic MT evaluation

    Popovi \'c , Maja. chr F : character n-gram F -score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. doi:10.18653/v1/W15-3049

  9. [9]

    doi:10.5281/zenodo.12608602 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  10. [10]

    Unsupervised

    Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...

  11. [11]

    Are All Languages Created Equal in Multilingual BERT ?

    Wu, Shijie and Dredze, Mark. Are All Languages Created Equal in Multilingual BERT ?. Proceedings of the 5th Workshop on Representation Learning for NLP. 2020. doi:10.18653/v1/2020.repl4nlp-1.16

  12. [12]

    MEGAVERSE : Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

    Ahuja, Sanchit and Aggarwal, Divyanshu and Gumma, Varun and Watts, Ishaan and Sathe, Ashutosh and Ochieng, Millicent and Hada, Rishav and Jain, Prachi and Ahmed, Mohamed and Bali, Kalika and Sitaram, Sunayana. MEGAVERSE : Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks. Proceedings of the 2024 Conference of the North Amer...

  13. [13]

    BUFFET : Benchmarking Large Language Models for Few-shot Cross-lingual Transfer

    Asai, Akari and Kudugunta, Sneha and Yu, Xinyan and Blevins, Terra and Gonen, Hila and Reid, Machel and Tsvetkov, Yulia and Ruder, Sebastian and Hajishirzi, Hannaneh. BUFFET : Benchmarking Large Language Models for Few-shot Cross-lingual Transfer. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...

  14. [14]

    Lifting the Curse of Multilinguality by Pre-training Modular Transformers

    Pfeiffer, Jonas and Goyal, Naman and Lin, Xi and Li, Xian and Cross, James and Riedel, Sebastian and Artetxe, Mikel. Lifting the Curse of Multilinguality by Pre-training Modular Transformers. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1...

  15. [15]

    and Zettlemoyer, Luke

    Blevins, Terra and Limisiewicz, Tomasz and Gururangan, Suchin and Li, Margaret and Gonen, Hila and Smith, Noah A. and Zettlemoyer, Luke. Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.604

  16. [16]

    2022 , eprint=

    Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models , author=. 2022 , eprint=

  17. [17]

    2020 , eprint=

    Beyond English-Centric Multilingual Machine Translation , author=. 2020 , eprint=

  18. [18]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  19. [19]

    Larger-Scale Transformers for Multilingual Masked Language Modeling

    Goyal, Naman and Du, Jingfei and Ott, Myle and Anantharaman, Giri and Conneau, Alexis. Larger-Scale Transformers for Multilingual Masked Language Modeling. Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). 2021. doi:10.18653/v1/2021.repl4nlp-1.4

  20. [20]

    Cross-lingual Language Model Pretraining , url =

    CONNEAU, Alexis and Lample, Guillaume , booktitle =. Cross-lingual Language Model Pretraining , url =

  21. [21]

    XLM - E : Cross-lingual Language Model Pre-training via ELECTRA

    Chi, Zewen and Huang, Shaohan and Dong, Li and Ma, Shuming and Zheng, Bo and Singhal, Saksham and Bajaj, Payal and Song, Xia and Mao, Xian-Ling and Huang, Heyan and Wei, Furu. XLM - E : Cross-lingual Language Model Pre-training via ELECTRA. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 202...

  22. [22]

    PARADISE : Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

    Reid, Machel and Artetxe, Mikel. PARADISE : Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.58

  23. [23]

    Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection

    Ogunremi, Tolulope and Jurafsky, Dan and Manning, Christopher D. Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.93

  24. [24]

    2023 , eprint=

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , author=. 2023 , eprint=

  25. [25]

    Cohen , abstract =

    Michael McCloskey and Neal J. Cohen , abstract =. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , editor =. 1989 , issn =. doi:https://doi.org/10.1016/S0079-7421(08)60536-8 , url =

  26. [26]

    Proceedings of the 35th International Conference on Machine Learning (ICML) , pages=

    Explicit Inductive Bias for Transfer Learning with Convolutional Networks , author=. Proceedings of the 35th International Conference on Machine Learning (ICML) , pages=. 2018 , organization=

  27. [27]

    Proceedings of the national academy of sciences , volume=

    Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

  28. [28]

    2025 , eprint=

    Continually Adding New Languages to Multilingual Language Models , author=. 2025 , eprint=

  29. [29]

    What Causes Knowledge Loss in Multilingual Language Models?

    Khelli, Maria and Cahyawijaya, Samuel and Purwarianti, Ayu and Winata, Genta Indra. What Causes Knowledge Loss in Multilingual Language Models?. Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics. 2025

  30. [30]

    arXiv preprint arXiv:2510.19546 , year=

    Conditions for Catastrophic Forgetting in Multilingual Translation , author=. arXiv preprint arXiv:2510.19546 , year=

  31. [31]

    arXiv preprint arXiv:2401.04088 , year=

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  32. [32]

    2017 , eprint=

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. 2017 , eprint=

  33. [33]

    arXiv preprint arXiv:2303.14177 , year=

    Scaling expert language models with unsupervised domain discovery , author=. arXiv preprint arXiv:2303.14177 , year=

  34. [34]

    Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation

    Chronopoulou, Alexandra and Stojanovski, Dario and Fraser, Alexander. Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation. Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023). 2023. doi:10.18653/v1/2023.loresmt-1.5

  35. [35]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  36. [36]

    2023 , eprint=

    Editing Models with Task Arithmetic , author=. 2023 , eprint=

  37. [37]

    Qwen3-Max: Just Scale it , author =

  38. [38]

    2025 , eprint=

    Babel: Open Multilingual Large Language Models Serving Over 90\ author=. 2025 , eprint=

  39. [39]

    2024 , eprint=

    Sailor: Open Language Models for South-East Asia , author=. 2024 , eprint=

  40. [40]

    Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers

    Wendler, Chris and Veselovsky, Veniamin and Monea, Giovanni and West, Robert. Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.820

  41. [41]

    Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal

    Huang, Jianheng and Cui, Leyang and Wang, Ante and Yang, Chengyi and Liao, Xinting and Song, Linfeng and Yao, Junfeng and Su, Jinsong. Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.186...

  42. [42]

    Overcoming Catastrophic Forgetting During Domain Adaptation of Seq2seq Language Generation

    Li, Dingcheng and Chen, Zheng and Cho, Eunah and Hao, Jie and Liu, Xiaohu and Xing, Fan and Guo, Chenlei and Liu, Yang. Overcoming Catastrophic Forgetting During Domain Adaptation of Seq2seq Language Generation. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2...

  43. [43]

    Downey, C. M. and Blevins, Terra and Serai, Dhwani and Parikh, Dwija and Steinert-Threlkeld, Shane. Targeted Multilingual Adaptation for Low-resource Language Families. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.918

  44. [44]

    Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

    Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy. Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages. Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021. doi:10.18653/v1/2021.mrl-1.11

  45. [45]

    2024 , eprint=

    RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining , author=. 2024 , eprint=

  46. [46]

    2024 , eprint=

    Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities , author=. 2024 , eprint=

  47. [47]

    2025 , note =

    Elaine Zosa and Jouni Luoma and Kai Hakala and Antti Virtanen and Mika Koistinen and Jonathan Burdge , title =. 2025 , note =

  48. [48]

    support.In:PracticeandExperienceinAdvancedResearchComputing2023:Com- puting for the Common Good

    Boerner, Timothy J. and Deems, Stephen and Furlani, Thomas R. and Knuth, Shelley L. and Towns, John , title =. Practice and Experience in Advanced Research Computing 2023: Computing for the Common Good , pages =. 2023 , isbn =. doi:10.1145/3569951.3597559 , abstract =