pith. sign in

arxiv: 2605.18083 · v1 · pith:V3KE6ABMnew · submitted 2026-05-18 · 💻 cs.CL

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAMDelta Integration into Upcycled MoE

Pith reviewed 2026-05-20 11:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords multilingual LLMsMixture of Expertsparameter deltalanguage expansionpost-trainingmodel upcyclingcontinued pre-trainingdata-efficient training
0
0 comments X

The pith

Grafting a MoE-expanded post-training delta onto an upcycled model adds new languages to LLMs while preserving original capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Expanding large language models to new languages demands heavy continued pre-training and alignment data, creating high costs and trade-offs when merging models. This paper proposes upcycling the dense base into a Mixture-of-Experts architecture that assigns separate experts to different languages, followed by grafting a post-training parameter delta to transfer alignment skills directly. A sympathetic reader would care because the approach aims to cut data needs and avoid the dilution of new-language gains or loss of original abilities that plague standard merging. If correct, it offers a practical shortcut for building multilingual LLMs that works across base models and delta types. Experiments show gains on expanded languages against compute-matched baselines while original performance holds steady.

Core claim

The paper claims that upcycling a dense LLM into an MoE with language-specific experts and then grafting an MoE-expanded post-training parameter delta onto the continued pre-training enhanced base model transfers alignment ability, bypassing full alignment retraining and sidestepping the parameter conflicts typical in direct merging.

What carries the argument

The MoE-expanded post-training parameter delta (PARAMΔ) grafted to the CPT-enhanced base after upcycling to allocate experts per language.

If this is right

  • Performance on the newly added languages improves over baselines with comparable FLOPs or parameter counts.
  • Capabilities on the original languages are preserved without significant loss.
  • The technique applies across multiple base models and different kinds of post-training deltas.
  • The full alignment phase after continued pre-training can be bypassed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The expert-per-language split in the upcycled MoE may reduce update interference in other multi-domain or multi-task settings.
  • Similar grafting could lower barriers for adding specialized capabilities such as code or domain knowledge with limited data.
  • The method points toward modular expansion strategies that could scale multilingual coverage to more low-resource languages.

Load-bearing premise

Grafting the MoE-expanded post-training delta onto the CPT-enhanced base model will transfer alignment ability without reintroducing parameter conflicts that plague existing merging techniques.

What would settle it

If the grafted model shows clear drops in original-language performance or fails to outperform similar-FLOP baselines on new-language tasks, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.18083 by Baosong Yang, Hao-Ran Wei, Hao Zhou, Jiajun Chen, Linjuan Wu, Shuaijie She, Shujian Huang, Tianhao Li, Zhijun Wang.

Figure 1
Figure 1. Figure 1: A visualization of the performance trade-off [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The two-stage DeltaMoE pipeline: 1) CPT via sparse upcycling with a frozen expert to preserve [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average expert selection frequency across [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Knowledge retention performance with data [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The prompt for flores evaluation [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The prompt for mmlu evaluation knowledge. We report zero-shot accuracy us￾ing the chain-of-thought prompt detailed in Sec￾tion C.2. To maintain representativeness while reducing computational overhead, we evaluate on a stratified subset created by sampling 10% of questions from each subject category. C.2 Multiple-Choice Question Prompting and Extraction For the multiple-choice question (MCQ) bench￾marks (M… view at source ↗
Figure 7
Figure 7. Figure 7: The prompt used for model-based answer extraction [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($\Delta_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PARAMΔ, a method for data-efficient multilingual LLM expansion. It upcycles a dense model into an MoE architecture with language-specific expert allocation, then grafts an MoE-expanded post-training parameter delta (Δ_post) onto a CPT-enhanced base model to transfer alignment capabilities while bypassing full alignment training. The central claim is that this resolves parameter conflicts in prior merging techniques, yielding superior performance on expanded languages compared to baselines with matched FLOPs or parameter counts, while preserving original capabilities; the approach is also shown to generalize across models and post-training deltas.

Significance. If the empirical results are robustly validated, the method could meaningfully lower the data and compute costs of multilingual LLM development by avoiding extensive alignment for new languages, providing a practical engineering route for efficient language expansion that preserves base-model performance.

major comments (2)
  1. Abstract: the central claim of superiority 'even against baselines with similar FLOPs or number of parameters' is asserted without any reported metrics, number of languages, statistical tests, or explicit baseline-construction details (e.g., how compute was matched or which benchmarks measured new-language gains versus original-capability preservation). This leaves the load-bearing empirical support for the method only weakly evidenced in the manuscript text.
  2. Grafting procedure (method description): the claim that grafting the MoE-expanded Δ_post 'bypasses the complex alignment phase' and avoids the conflicts attributed to standard merging rests on the unstated assumption that language-specific expert allocation cleanly isolates alignment signals. No concrete mechanism is given for preventing cross-expert interference during grafting or inference, which directly risks undermining the asserted preservation of original capabilities.
minor comments (2)
  1. Notation: the abstract alternates between 'PARAMΔ' and 'method'; a single consistent name and acronym would improve readability.
  2. The abstract does not reference any ablation studies on routing or expert allocation, which would help isolate the contribution of the MoE upcycling step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: Abstract: the central claim of superiority 'even against baselines with similar FLOPs or number of parameters' is asserted without any reported metrics, number of languages, statistical tests, or explicit baseline-construction details (e.g., how compute was matched or which benchmarks measured new-language gains versus original-capability preservation). This leaves the load-bearing empirical support for the method only weakly evidenced in the manuscript text.

    Authors: We agree that the abstract would be strengthened by greater specificity. The manuscript reports the relevant metrics, the number of languages, baseline construction details (upcycling to match total parameters and FLOPs), and the benchmarks used for new-language gains versus original-capability preservation in the Experiments section. We have revised the abstract to reference the number of languages evaluated and the nature of the matched-FLOPs and matched-parameter baselines more explicitly. revision: yes

  2. Referee: Grafting procedure (method description): the claim that grafting the MoE-expanded Δ_post 'bypasses the complex alignment phase' and avoids the conflicts attributed to standard merging rests on the unstated assumption that language-specific expert allocation cleanly isolates alignment signals. No concrete mechanism is given for preventing cross-expert interference during grafting or inference, which directly risks undermining the asserted preservation of original capabilities.

    Authors: The language-specific expert allocation performed during upcycling is the mechanism that enables isolation: the post-training delta is expanded to the same MoE structure so that alignment parameters are grafted only onto the corresponding per-language experts. Routing at inference time then activates the appropriate experts for each language, limiting cross-expert interference. We have expanded the method section with a step-by-step description of the grafting process and an explicit discussion of how expert isolation and routing reduce the parameter conflicts that arise in dense merging. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering method with independent experimental validation

full rationale

The paper introduces PARAMΔ as a practical technique for multilingual LLM expansion: upcycling a dense model to MoE, then grafting an expanded post-training delta onto a CPT base to transfer alignment without full re-alignment. Claims rest on experimental comparisons (performance on expanded languages, preservation of original capabilities, applicability across models/deltas) rather than any derivation, equation, or prediction that reduces to fitted inputs or self-citations. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the described method or abstract. The central bypass of merging conflicts is framed as an empirical outcome, not a mathematical necessity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the unstated premise that expert allocation in the upcycled MoE plus delta grafting can simultaneously solve language acquisition and capability preservation; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5741 in / 1127 out tokens · 31767 ms · 2026-05-20T11:05:00.817088+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 12 internal anchors

  1. [1]

    L., Li, B., Lei, B., Wang, B., Rong, B., Wang, C., Zhang, C., Gao, C., Zhang, C., Sun, C., et al

    LongCat-Flash Technical Report , author=. arXiv preprint arXiv:2509.01322 , year=

  2. [2]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  4. [4]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Glm-4.5: Agentic, reasoning, and coding (arc) foundation models , author=. arXiv preprint arXiv:2508.06471 , year=

  5. [5]

    Kimi K2: Open Agentic Intelligence

    Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

  6. [6]

    A Survey on Large Language Models for Code Generation

    A survey on large language models for code generation , author=. arXiv preprint arXiv:2406.00515 , year=

  7. [7]

    arXiv preprint arXiv:2506.08446 , year=

    A Survey on Large Language Models for Mathematical Reasoning , author=. arXiv preprint arXiv:2506.08446 , year=

  8. [8]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Large language model agent: A survey on methodology, applications and challenges , author=. arXiv preprint arXiv:2503.21460 , year=

  9. [9]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Factuality of Large Language Models: A Survey , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  10. [10]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  11. [11]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  12. [12]

    Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin

    Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs , author=. arXiv preprint arXiv:2502.12982 , year=

  13. [13]

    Journal of Machine Learning Research , volume=

    Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

  14. [14]

    Stanford Center for Research on Foundation Models

    Alpaca: A strong, replicable instruction-following model , author=. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html , volume=

  15. [15]

    Ibrahim, B

    Simple and scalable strategies to continually pre-train large language models , author=. arXiv preprint arXiv:2403.08763 , year=

  16. [16]

    arXiv preprint arXiv:2412.11704 , year=

    ElChat: Adapting Chat Language Models Using Only Target Unlabeled Language Data , author=. arXiv preprint arXiv:2412.11704 , year=

  17. [17]

    Sheng Cao and Mingrui Wu and Karthik Prasad and Yuandong Tian and Zechun Liu , booktitle=. Param\. 2025 , url=

  18. [18]

    International conference on machine learning , pages=

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=

  19. [19]

    The Eleventh International Conference on Learning Representations , year=

    Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints , author=. The Eleventh International Conference on Learning Representations , year=

  20. [20]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  21. [21]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  22. [22]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

  23. [23]

    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

  24. [24]

    arXiv preprint arXiv:2506.20920 , year=

    FineWeb2: One Pipeline to Scale Them All--Adapting Pre-Training Data Processing to Every Language , author=. arXiv preprint arXiv:2506.20920 , year=

  25. [25]

    arXiv preprint arXiv:2309.09400 , year=

    Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages , author=. arXiv preprint arXiv:2309.09400 , year=

  26. [26]

    arXiv preprint arXiv:2502.07346 , year=

    Benchmax: A comprehensive multilingual evaluation suite for large language models , author=. arXiv preprint arXiv:2502.07346 , year=

  27. [27]

    The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants

    The belebele benchmark: a parallel reading comprehension dataset in 122 language variants , author=. arXiv preprint arXiv:2308.16884 , year=

  28. [28]

    doi:10.57967/hf/5618 , publisher =

    Alexandra Institute , title =. doi:10.57967/hf/5618 , publisher =

  29. [29]

    Transactions of the Association for Computational Linguistics , volume=

    xcomet: Transparent machine translation evaluation through fine-grained error detection , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

  30. [30]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=

  31. [31]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    LLaMA Pro: Progressive LLaMA with Block Expansion , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  32. [32]

    Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,

    Higher layers need more lora experts , author=. arXiv preprint arXiv:2402.08562 , year=

  33. [33]

    Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

  34. [34]

    Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi , booktitle=

    Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Evan Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A....

  35. [35]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Do llamas work in english? on the latent language of multilingual transformers , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  36. [36]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  37. [37]

    2023 , eprint=

    Skywork: A More Open Bilingual Foundation Model , author=. 2023 , eprint=

  38. [38]

    Advances in Neural Information Processing Systems , volume=

    Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

  39. [39]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

  40. [40]

    International conference on machine learning , pages=

    Glam: Efficient scaling of language models with mixture-of-experts , author=. International conference on machine learning , pages=. 2022 , organization=

  41. [41]

    Proceedings of the 36th International Conference on Neural Information Processing Systems , pages=

    Training compute-optimal large language models , author=. Proceedings of the 36th International Conference on Neural Information Processing Systems , pages=

  42. [42]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  43. [43]

    2021 , url=

    Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle=. 2021 , url=

  44. [44]

    Forty-second International Conference on Machine Learning , year=

    Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts , author=. Forty-second International Conference on Machine Learning , year=

  45. [45]

    arXiv preprint arXiv:2506.12388 , year=

    Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model , author=. arXiv preprint arXiv:2506.12388 , year=

  46. [46]

    arXiv preprint arXiv:2505.22582 , year=

    Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts , author=. arXiv preprint arXiv:2505.22582 , year=

  47. [47]

    Forty-first International Conference on Machine Learning , year=

    Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=

  48. [48]

    arXiv preprint arXiv:2502.16770 , year=

    Led-merging: Mitigating safety-utility conflicts in model merging with location-election-disjoint , author=. arXiv preprint arXiv:2502.16770 , year=

  49. [49]

    Advances in Neural Information Processing Systems , volume=

    Ties-merging: Resolving interference when merging models , author=. Advances in Neural Information Processing Systems , volume=

  50. [50]

    arXiv preprint arXiv:2503.20641 , year=

    Unlocking efficient long-to-short llm reasoning with model merging , author=. arXiv preprint arXiv:2503.20641 , year=

  51. [51]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

  52. [52]

    The Thirteenth International Conference on Learning Representations , year=

    MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts , author=. The Thirteenth International Conference on Learning Representations , year=

  53. [53]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  54. [54]

    Language models can self-lengthen to generate long texts

    Language models can self-lengthen to generate long texts , author=. arXiv preprint arXiv:2410.23933 , year=

  55. [55]

    DeepSeek-AI, D

    Aya expanse: Combining research breakthroughs for a new multilingual frontier , author=. arXiv preprint arXiv:2412.04261 , year=