A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAMDelta Integration into Upcycled MoE
Pith reviewed 2026-05-20 11:05 UTC · model grok-4.3
The pith
Grafting a MoE-expanded post-training delta onto an upcycled model adds new languages to LLMs while preserving original capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that upcycling a dense LLM into an MoE with language-specific experts and then grafting an MoE-expanded post-training parameter delta onto the continued pre-training enhanced base model transfers alignment ability, bypassing full alignment retraining and sidestepping the parameter conflicts typical in direct merging.
What carries the argument
The MoE-expanded post-training parameter delta (PARAMΔ) grafted to the CPT-enhanced base after upcycling to allocate experts per language.
If this is right
- Performance on the newly added languages improves over baselines with comparable FLOPs or parameter counts.
- Capabilities on the original languages are preserved without significant loss.
- The technique applies across multiple base models and different kinds of post-training deltas.
- The full alignment phase after continued pre-training can be bypassed.
Where Pith is reading between the lines
- The expert-per-language split in the upcycled MoE may reduce update interference in other multi-domain or multi-task settings.
- Similar grafting could lower barriers for adding specialized capabilities such as code or domain knowledge with limited data.
- The method points toward modular expansion strategies that could scale multilingual coverage to more low-resource languages.
Load-bearing premise
Grafting the MoE-expanded post-training delta onto the CPT-enhanced base model will transfer alignment ability without reintroducing parameter conflicts that plague existing merging techniques.
What would settle it
If the grafted model shows clear drops in original-language performance or fails to outperform similar-FLOP baselines on new-language tasks, the central claim would be refuted.
Figures
read the original abstract
Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($\Delta_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PARAMΔ, a method for data-efficient multilingual LLM expansion. It upcycles a dense model into an MoE architecture with language-specific expert allocation, then grafts an MoE-expanded post-training parameter delta (Δ_post) onto a CPT-enhanced base model to transfer alignment capabilities while bypassing full alignment training. The central claim is that this resolves parameter conflicts in prior merging techniques, yielding superior performance on expanded languages compared to baselines with matched FLOPs or parameter counts, while preserving original capabilities; the approach is also shown to generalize across models and post-training deltas.
Significance. If the empirical results are robustly validated, the method could meaningfully lower the data and compute costs of multilingual LLM development by avoiding extensive alignment for new languages, providing a practical engineering route for efficient language expansion that preserves base-model performance.
major comments (2)
- Abstract: the central claim of superiority 'even against baselines with similar FLOPs or number of parameters' is asserted without any reported metrics, number of languages, statistical tests, or explicit baseline-construction details (e.g., how compute was matched or which benchmarks measured new-language gains versus original-capability preservation). This leaves the load-bearing empirical support for the method only weakly evidenced in the manuscript text.
- Grafting procedure (method description): the claim that grafting the MoE-expanded Δ_post 'bypasses the complex alignment phase' and avoids the conflicts attributed to standard merging rests on the unstated assumption that language-specific expert allocation cleanly isolates alignment signals. No concrete mechanism is given for preventing cross-expert interference during grafting or inference, which directly risks undermining the asserted preservation of original capabilities.
minor comments (2)
- Notation: the abstract alternates between 'PARAMΔ' and 'method'; a single consistent name and acronym would improve readability.
- The abstract does not reference any ablation studies on routing or expert allocation, which would help isolate the contribution of the MoE upcycling step.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: Abstract: the central claim of superiority 'even against baselines with similar FLOPs or number of parameters' is asserted without any reported metrics, number of languages, statistical tests, or explicit baseline-construction details (e.g., how compute was matched or which benchmarks measured new-language gains versus original-capability preservation). This leaves the load-bearing empirical support for the method only weakly evidenced in the manuscript text.
Authors: We agree that the abstract would be strengthened by greater specificity. The manuscript reports the relevant metrics, the number of languages, baseline construction details (upcycling to match total parameters and FLOPs), and the benchmarks used for new-language gains versus original-capability preservation in the Experiments section. We have revised the abstract to reference the number of languages evaluated and the nature of the matched-FLOPs and matched-parameter baselines more explicitly. revision: yes
-
Referee: Grafting procedure (method description): the claim that grafting the MoE-expanded Δ_post 'bypasses the complex alignment phase' and avoids the conflicts attributed to standard merging rests on the unstated assumption that language-specific expert allocation cleanly isolates alignment signals. No concrete mechanism is given for preventing cross-expert interference during grafting or inference, which directly risks undermining the asserted preservation of original capabilities.
Authors: The language-specific expert allocation performed during upcycling is the mechanism that enables isolation: the post-training delta is expanded to the same MoE structure so that alignment parameters are grafted only onto the corresponding per-language experts. Routing at inference time then activates the appropriate experts for each language, limiting cross-expert interference. We have expanded the method section with a step-by-step description of the grafting process and an explicit discussion of how expert isolation and routing reduce the parameter conflicts that arise in dense merging. revision: yes
Circularity Check
No circularity: empirical engineering method with independent experimental validation
full rationale
The paper introduces PARAMΔ as a practical technique for multilingual LLM expansion: upcycling a dense model to MoE, then grafting an expanded post-training delta onto a CPT base to transfer alignment without full re-alignment. Claims rest on experimental comparisons (performance on expanded languages, preservation of original capabilities, applicability across models/deltas) rather than any derivation, equation, or prediction that reduces to fitted inputs or self-citations. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the described method or abstract. The central bypass of merging conflicts is framed as an empirical outcome, not a mathematical necessity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DeltaMoE upcycles a dense model into MoE, freezes expert 0 as knowledge anchor, grafts MoE-expanded Δpost = θpost − θbase uniformly across experts
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Router specialization and load-balancing loss LLB to allocate languages without catastrophic forgetting
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
L., Li, B., Lei, B., Wang, B., Rong, B., Wang, C., Zhang, C., Gao, C., Zhang, C., Sun, C., et al
LongCat-Flash Technical Report , author=. arXiv preprint arXiv:2509.01322 , year=
-
[2]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Glm-4.5: Agentic, reasoning, and coding (arc) foundation models , author=. arXiv preprint arXiv:2508.06471 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Kimi K2: Open Agentic Intelligence
Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
A Survey on Large Language Models for Code Generation
A survey on large language models for code generation , author=. arXiv preprint arXiv:2406.00515 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
arXiv preprint arXiv:2506.08446 , year=
A Survey on Large Language Models for Mathematical Reasoning , author=. arXiv preprint arXiv:2506.08446 , year=
-
[8]
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Large language model agent: A survey on methodology, applications and challenges , author=. arXiv preprint arXiv:2503.21460 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
Factuality of Large Language Models: A Survey , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2024
-
[10]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[12]
Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs , author=. arXiv preprint arXiv:2502.12982 , year=
-
[13]
Journal of Machine Learning Research , volume=
Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=
-
[14]
Stanford Center for Research on Foundation Models
Alpaca: A strong, replicable instruction-following model , author=. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html , volume=
work page 2023
-
[15]
Simple and scalable strategies to continually pre-train large language models , author=. arXiv preprint arXiv:2403.08763 , year=
-
[16]
arXiv preprint arXiv:2412.11704 , year=
ElChat: Adapting Chat Language Models Using Only Target Unlabeled Language Data , author=. arXiv preprint arXiv:2412.11704 , year=
-
[17]
Sheng Cao and Mingrui Wu and Karthik Prasad and Yuandong Tian and Zechun Liu , booktitle=. Param\. 2025 , url=
work page 2025
-
[18]
International conference on machine learning , pages=
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=
work page 2022
-
[19]
The Eleventh International Conference on Learning Representations , year=
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints , author=. The Eleventh International Conference on Learning Representations , year=
-
[20]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[22]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=
work page 2024
-
[23]
Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=
work page 2020
-
[24]
arXiv preprint arXiv:2506.20920 , year=
FineWeb2: One Pipeline to Scale Them All--Adapting Pre-Training Data Processing to Every Language , author=. arXiv preprint arXiv:2506.20920 , year=
-
[25]
arXiv preprint arXiv:2309.09400 , year=
Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages , author=. arXiv preprint arXiv:2309.09400 , year=
-
[26]
arXiv preprint arXiv:2502.07346 , year=
Benchmax: A comprehensive multilingual evaluation suite for large language models , author=. arXiv preprint arXiv:2502.07346 , year=
-
[27]
The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants
The belebele benchmark: a parallel reading comprehension dataset in 122 language variants , author=. arXiv preprint arXiv:2308.16884 , year=
-
[28]
doi:10.57967/hf/5618 , publisher =
Alexandra Institute , title =. doi:10.57967/hf/5618 , publisher =
-
[29]
Transactions of the Association for Computational Linguistics , volume=
xcomet: Transparent machine translation evaluation through fine-grained error detection , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=
work page 2024
-
[30]
No Language Left Behind: Scaling Human-Centered Machine Translation
No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
LLaMA Pro: Progressive LLaMA with Block Expansion , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[32]
Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,
Higher layers need more lora experts , author=. arXiv preprint arXiv:2402.08562 , year=
-
[33]
Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=
work page 2022
-
[34]
Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi , booktitle=
Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Evan Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A....
work page 2025
-
[35]
Do llamas work in english? on the latent language of multilingual transformers , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[36]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[37]
Skywork: A More Open Bilingual Foundation Model , author=. 2023 , eprint=
work page 2023
-
[38]
Advances in Neural Information Processing Systems , volume=
Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=
-
[39]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
International conference on machine learning , pages=
Glam: Efficient scaling of language models with mixture-of-experts , author=. International conference on machine learning , pages=. 2022 , organization=
work page 2022
-
[41]
Proceedings of the 36th International Conference on Neural Information Processing Systems , pages=
Training compute-optimal large language models , author=. Proceedings of the 36th International Conference on Neural Information Processing Systems , pages=
-
[42]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[43]
Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle=. 2021 , url=
work page 2021
-
[44]
Forty-second International Conference on Machine Learning , year=
Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts , author=. Forty-second International Conference on Machine Learning , year=
-
[45]
arXiv preprint arXiv:2506.12388 , year=
Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model , author=. arXiv preprint arXiv:2506.12388 , year=
-
[46]
arXiv preprint arXiv:2505.22582 , year=
Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts , author=. arXiv preprint arXiv:2505.22582 , year=
-
[47]
Forty-first International Conference on Machine Learning , year=
Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=
-
[48]
arXiv preprint arXiv:2502.16770 , year=
Led-merging: Mitigating safety-utility conflicts in model merging with location-election-disjoint , author=. arXiv preprint arXiv:2502.16770 , year=
-
[49]
Advances in Neural Information Processing Systems , volume=
Ties-merging: Resolving interference when merging models , author=. Advances in Neural Information Processing Systems , volume=
-
[50]
arXiv preprint arXiv:2503.20641 , year=
Unlocking efficient long-to-short llm reasoning with model merging , author=. arXiv preprint arXiv:2503.20641 , year=
-
[51]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
The Thirteenth International Conference on Learning Representations , year=
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts , author=. The Thirteenth International Conference on Learning Representations , year=
-
[53]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[54]
Language models can self-lengthen to generate long texts
Language models can self-lengthen to generate long texts , author=. arXiv preprint arXiv:2410.23933 , year=
-
[55]
Aya expanse: Combining research breakthroughs for a new multilingual frontier , author=. arXiv preprint arXiv:2412.04261 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.