A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE

Baosong Yang; Hao-Ran Wei; Hao Zhou; Jiajun Chen; Linjuan Wu; Shuaijie She; Shujian Huang; Tianhao Li; Zhijun Wang

arxiv: 2605.18083 · v1 · pith:V3KE6ABMnew · submitted 2026-05-18 · 💻 cs.CL

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAMDelta Integration into Upcycled MoE

Hao Zhou , Tianhao Li , Zhijun Wang , Shuaijie She , Linjuan Wu , Hao-Ran Wei , Baosong Yang , Jiajun Chen

show 1 more author

Shujian Huang

This is my paper

Pith reviewed 2026-05-20 11:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual LLMsMixture of Expertsparameter deltalanguage expansionpost-trainingmodel upcyclingcontinued pre-trainingdata-efficient training

0 comments

The pith

Grafting a MoE-expanded post-training delta onto an upcycled model adds new languages to LLMs while preserving original capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Expanding large language models to new languages demands heavy continued pre-training and alignment data, creating high costs and trade-offs when merging models. This paper proposes upcycling the dense base into a Mixture-of-Experts architecture that assigns separate experts to different languages, followed by grafting a post-training parameter delta to transfer alignment skills directly. A sympathetic reader would care because the approach aims to cut data needs and avoid the dilution of new-language gains or loss of original abilities that plague standard merging. If correct, it offers a practical shortcut for building multilingual LLMs that works across base models and delta types. Experiments show gains on expanded languages against compute-matched baselines while original performance holds steady.

Core claim

The paper claims that upcycling a dense LLM into an MoE with language-specific experts and then grafting an MoE-expanded post-training parameter delta onto the continued pre-training enhanced base model transfers alignment ability, bypassing full alignment retraining and sidestepping the parameter conflicts typical in direct merging.

What carries the argument

The MoE-expanded post-training parameter delta (PARAMΔ) grafted to the CPT-enhanced base after upcycling to allocate experts per language.

If this is right

Performance on the newly added languages improves over baselines with comparable FLOPs or parameter counts.
Capabilities on the original languages are preserved without significant loss.
The technique applies across multiple base models and different kinds of post-training deltas.
The full alignment phase after continued pre-training can be bypassed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The expert-per-language split in the upcycled MoE may reduce update interference in other multi-domain or multi-task settings.
Similar grafting could lower barriers for adding specialized capabilities such as code or domain knowledge with limited data.
The method points toward modular expansion strategies that could scale multilingual coverage to more low-resource languages.

Load-bearing premise

Grafting the MoE-expanded post-training delta onto the CPT-enhanced base model will transfer alignment ability without reintroducing parameter conflicts that plague existing merging techniques.

What would settle it

If the grafted model shows clear drops in original-language performance or fails to outperform similar-FLOP baselines on new-language tasks, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.18083 by Baosong Yang, Hao-Ran Wei, Hao Zhou, Jiajun Chen, Linjuan Wu, Shuaijie She, Shujian Huang, Tianhao Li, Zhijun Wang.

**Figure 2.** Figure 2: The two-stage DeltaMoE pipeline: 1) CPT via sparse upcycling with a frozen expert to preserve [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Average expert selection frequency across [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Knowledge retention performance with data [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The prompt for flores evaluation [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: The prompt for mmlu evaluation knowledge. We report zero-shot accuracy using the chain-of-thought prompt detailed in Section C.2. To maintain representativeness while reducing computational overhead, we evaluate on a stratified subset created by sampling 10% of questions from each subject category. C.2 Multiple-Choice Question Prompting and Extraction For the multiple-choice question (MCQ) benchmarks (M… view at source ↗

**Figure 7.** Figure 7: The prompt used for model-based answer extraction [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($\Delta_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical recipe for adding languages via MoE upcycling plus delta grafting that sidesteps full alignment, but the abstract leaves the performance claims under-supported.

read the letter

The main point is that the authors describe a way to expand an LLM to new languages by first upcycling the dense model into an MoE with language-specific experts, running continued pre-training on the base, and then grafting an expanded post-training parameter delta to add alignment without a separate alignment run. They position this as better than standard merging because the MoE structure reduces the usual conflicts between keeping old capabilities and gaining new ones.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PARAMΔ, a method for data-efficient multilingual LLM expansion. It upcycles a dense model into an MoE architecture with language-specific expert allocation, then grafts an MoE-expanded post-training parameter delta (Δ_post) onto a CPT-enhanced base model to transfer alignment capabilities while bypassing full alignment training. The central claim is that this resolves parameter conflicts in prior merging techniques, yielding superior performance on expanded languages compared to baselines with matched FLOPs or parameter counts, while preserving original capabilities; the approach is also shown to generalize across models and post-training deltas.

Significance. If the empirical results are robustly validated, the method could meaningfully lower the data and compute costs of multilingual LLM development by avoiding extensive alignment for new languages, providing a practical engineering route for efficient language expansion that preserves base-model performance.

major comments (2)

Abstract: the central claim of superiority 'even against baselines with similar FLOPs or number of parameters' is asserted without any reported metrics, number of languages, statistical tests, or explicit baseline-construction details (e.g., how compute was matched or which benchmarks measured new-language gains versus original-capability preservation). This leaves the load-bearing empirical support for the method only weakly evidenced in the manuscript text.
Grafting procedure (method description): the claim that grafting the MoE-expanded Δ_post 'bypasses the complex alignment phase' and avoids the conflicts attributed to standard merging rests on the unstated assumption that language-specific expert allocation cleanly isolates alignment signals. No concrete mechanism is given for preventing cross-expert interference during grafting or inference, which directly risks undermining the asserted preservation of original capabilities.

minor comments (2)

Notation: the abstract alternates between 'PARAMΔ' and 'method'; a single consistent name and acronym would improve readability.
The abstract does not reference any ablation studies on routing or expert allocation, which would help isolate the contribution of the MoE upcycling step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: Abstract: the central claim of superiority 'even against baselines with similar FLOPs or number of parameters' is asserted without any reported metrics, number of languages, statistical tests, or explicit baseline-construction details (e.g., how compute was matched or which benchmarks measured new-language gains versus original-capability preservation). This leaves the load-bearing empirical support for the method only weakly evidenced in the manuscript text.

Authors: We agree that the abstract would be strengthened by greater specificity. The manuscript reports the relevant metrics, the number of languages, baseline construction details (upcycling to match total parameters and FLOPs), and the benchmarks used for new-language gains versus original-capability preservation in the Experiments section. We have revised the abstract to reference the number of languages evaluated and the nature of the matched-FLOPs and matched-parameter baselines more explicitly. revision: yes
Referee: Grafting procedure (method description): the claim that grafting the MoE-expanded Δ_post 'bypasses the complex alignment phase' and avoids the conflicts attributed to standard merging rests on the unstated assumption that language-specific expert allocation cleanly isolates alignment signals. No concrete mechanism is given for preventing cross-expert interference during grafting or inference, which directly risks undermining the asserted preservation of original capabilities.

Authors: The language-specific expert allocation performed during upcycling is the mechanism that enables isolation: the post-training delta is expanded to the same MoE structure so that alignment parameters are grafted only onto the corresponding per-language experts. Routing at inference time then activates the appropriate experts for each language, limiting cross-expert interference. We have expanded the method section with a step-by-step description of the grafting process and an explicit discussion of how expert isolation and routing reduce the parameter conflicts that arise in dense merging. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering method with independent experimental validation

full rationale

The paper introduces PARAMΔ as a practical technique for multilingual LLM expansion: upcycling a dense model to MoE, then grafting an expanded post-training delta onto a CPT base to transfer alignment without full re-alignment. Claims rest on experimental comparisons (performance on expanded languages, preservation of original capabilities, applicability across models/deltas) rather than any derivation, equation, or prediction that reduces to fitted inputs or self-citations. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the described method or abstract. The central bypass of merging conflicts is framed as an empirical outcome, not a mathematical necessity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the unstated premise that expert allocation in the upcycled MoE plus delta grafting can simultaneously solve language acquisition and capability preservation; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5741 in / 1127 out tokens · 31767 ms · 2026-05-20T11:05:00.817088+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeltaMoE upcycles a dense model into MoE, freezes expert 0 as knowledge anchor, grafts MoE-expanded Δpost = θpost − θbase uniformly across experts
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Router specialization and load-balancing loss LLB to allocate languages without catastrophic forgetting

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 12 internal anchors

[1]

L., Li, B., Lei, B., Wang, B., Rong, B., Wang, C., Zhang, C., Gao, C., Zhang, C., Sun, C., et al

LongCat-Flash Technical Report , author=. arXiv preprint arXiv:2509.01322 , year=

work page arXiv
[2]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models , author=. arXiv preprint arXiv:2508.06471 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

A Survey on Large Language Models for Code Generation

A survey on large language models for code generation , author=. arXiv preprint arXiv:2406.00515 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2506.08446 , year=

A Survey on Large Language Models for Mathematical Reasoning , author=. arXiv preprint arXiv:2506.08446 , year=

work page arXiv
[8]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Large language model agent: A survey on methodology, applications and challenges , author=. arXiv preprint arXiv:2503.21460 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Factuality of Large Language Models: A Survey , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[10]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[12]

Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs , author=. arXiv preprint arXiv:2502.12982 , year=

work page arXiv
[13]

Journal of Machine Learning Research , volume=

Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

work page
[14]

Stanford Center for Research on Foundation Models

Alpaca: A strong, replicable instruction-following model , author=. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html , volume=

work page 2023
[15]

Ibrahim, B

Simple and scalable strategies to continually pre-train large language models , author=. arXiv preprint arXiv:2403.08763 , year=

work page arXiv
[16]

arXiv preprint arXiv:2412.11704 , year=

ElChat: Adapting Chat Language Models Using Only Target Unlabeled Language Data , author=. arXiv preprint arXiv:2412.11704 , year=

work page arXiv
[17]

Sheng Cao and Mingrui Wu and Karthik Prasad and Yuandong Tian and Zechun Liu , booktitle=. Param\. 2025 , url=

work page 2025
[18]

International conference on machine learning , pages=

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022
[19]

The Eleventh International Conference on Learning Representations , year=

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints , author=. The Eleventh International Conference on Learning Representations , year=

work page
[20]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[22]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

work page 2024
[23]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020
[24]

arXiv preprint arXiv:2506.20920 , year=

FineWeb2: One Pipeline to Scale Them All--Adapting Pre-Training Data Processing to Every Language , author=. arXiv preprint arXiv:2506.20920 , year=

work page arXiv
[25]

arXiv preprint arXiv:2309.09400 , year=

Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages , author=. arXiv preprint arXiv:2309.09400 , year=

work page arXiv
[26]

arXiv preprint arXiv:2502.07346 , year=

Benchmax: A comprehensive multilingual evaluation suite for large language models , author=. arXiv preprint arXiv:2502.07346 , year=

work page arXiv
[27]

The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants

The belebele benchmark: a parallel reading comprehension dataset in 122 language variants , author=. arXiv preprint arXiv:2308.16884 , year=

work page arXiv
[28]

doi:10.57967/hf/5618 , publisher =

Alexandra Institute , title =. doi:10.57967/hf/5618 , publisher =

work page doi:10.57967/hf/5618
[29]

Transactions of the Association for Computational Linguistics , volume=

xcomet: Transparent machine translation evaluation through fine-grained error detection , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

work page 2024
[30]

No Language Left Behind: Scaling Human-Centered Machine Translation

No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

LLaMA Pro: Progressive LLaMA with Block Expansion , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[32]

Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,

Higher layers need more lora experts , author=. arXiv preprint arXiv:2402.08562 , year=

work page arXiv
[33]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022
[34]

Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi , booktitle=

Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Evan Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A....

work page 2025
[35]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Do llamas work in english? on the latent language of multilingual transformers , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[36]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[37]

2023 , eprint=

Skywork: A More Open Bilingual Foundation Model , author=. 2023 , eprint=

work page 2023
[38]

Advances in Neural Information Processing Systems , volume=

Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[39]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

International conference on machine learning , pages=

Glam: Efficient scaling of language models with mixture-of-experts , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022
[41]

Proceedings of the 36th International Conference on Neural Information Processing Systems , pages=

Training compute-optimal large language models , author=. Proceedings of the 36th International Conference on Neural Information Processing Systems , pages=

work page
[42]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[43]

2021 , url=

Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle=. 2021 , url=

work page 2021
[44]

Forty-second International Conference on Machine Learning , year=

Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts , author=. Forty-second International Conference on Machine Learning , year=

work page
[45]

arXiv preprint arXiv:2506.12388 , year=

Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model , author=. arXiv preprint arXiv:2506.12388 , year=

work page arXiv
[46]

arXiv preprint arXiv:2505.22582 , year=

Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts , author=. arXiv preprint arXiv:2505.22582 , year=

work page arXiv
[47]

Forty-first International Conference on Machine Learning , year=

Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=

work page
[48]

arXiv preprint arXiv:2502.16770 , year=

Led-merging: Mitigating safety-utility conflicts in model merging with location-election-disjoint , author=. arXiv preprint arXiv:2502.16770 , year=

work page arXiv
[49]

Advances in Neural Information Processing Systems , volume=

Ties-merging: Resolving interference when merging models , author=. Advances in Neural Information Processing Systems , volume=

work page
[50]

arXiv preprint arXiv:2503.20641 , year=

Unlocking efficient long-to-short llm reasoning with model merging , author=. arXiv preprint arXiv:2503.20641 , year=

work page arXiv
[51]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

The Thirteenth International Conference on Learning Representations , year=

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[53]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[54]

Language models can self-lengthen to generate long texts

Language models can self-lengthen to generate long texts , author=. arXiv preprint arXiv:2410.23933 , year=

work page arXiv
[55]

DeepSeek-AI, D

Aya expanse: Combining research breakthroughs for a new multilingual frontier , author=. arXiv preprint arXiv:2412.04261 , year=

work page arXiv

[1] [1]

L., Li, B., Lei, B., Wang, B., Rong, B., Wang, C., Zhang, C., Gao, C., Zhang, C., Sun, C., et al

LongCat-Flash Technical Report , author=. arXiv preprint arXiv:2509.01322 , year=

work page arXiv

[2] [2]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models , author=. arXiv preprint arXiv:2508.06471 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

A Survey on Large Language Models for Code Generation

A survey on large language models for code generation , author=. arXiv preprint arXiv:2406.00515 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2506.08446 , year=

A Survey on Large Language Models for Mathematical Reasoning , author=. arXiv preprint arXiv:2506.08446 , year=

work page arXiv

[8] [8]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Large language model agent: A survey on methodology, applications and challenges , author=. arXiv preprint arXiv:2503.21460 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Factuality of Large Language Models: A Survey , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[10] [10]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[12] [12]

Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs , author=. arXiv preprint arXiv:2502.12982 , year=

work page arXiv

[13] [13]

Journal of Machine Learning Research , volume=

Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

work page

[14] [14]

Stanford Center for Research on Foundation Models

Alpaca: A strong, replicable instruction-following model , author=. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html , volume=

work page 2023

[15] [15]

Ibrahim, B

Simple and scalable strategies to continually pre-train large language models , author=. arXiv preprint arXiv:2403.08763 , year=

work page arXiv

[16] [16]

arXiv preprint arXiv:2412.11704 , year=

ElChat: Adapting Chat Language Models Using Only Target Unlabeled Language Data , author=. arXiv preprint arXiv:2412.11704 , year=

work page arXiv

[17] [17]

Sheng Cao and Mingrui Wu and Karthik Prasad and Yuandong Tian and Zechun Liu , booktitle=. Param\. 2025 , url=

work page 2025

[18] [18]

International conference on machine learning , pages=

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022

[19] [19]

The Eleventh International Conference on Learning Representations , year=

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints , author=. The Eleventh International Conference on Learning Representations , year=

work page

[20] [20]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[22] [22]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

work page 2024

[23] [23]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020

[24] [24]

arXiv preprint arXiv:2506.20920 , year=

FineWeb2: One Pipeline to Scale Them All--Adapting Pre-Training Data Processing to Every Language , author=. arXiv preprint arXiv:2506.20920 , year=

work page arXiv

[25] [25]

arXiv preprint arXiv:2309.09400 , year=

Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages , author=. arXiv preprint arXiv:2309.09400 , year=

work page arXiv

[26] [26]

arXiv preprint arXiv:2502.07346 , year=

Benchmax: A comprehensive multilingual evaluation suite for large language models , author=. arXiv preprint arXiv:2502.07346 , year=

work page arXiv

[27] [27]

The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants

The belebele benchmark: a parallel reading comprehension dataset in 122 language variants , author=. arXiv preprint arXiv:2308.16884 , year=

work page arXiv

[28] [28]

doi:10.57967/hf/5618 , publisher =

Alexandra Institute , title =. doi:10.57967/hf/5618 , publisher =

work page doi:10.57967/hf/5618

[29] [29]

Transactions of the Association for Computational Linguistics , volume=

xcomet: Transparent machine translation evaluation through fine-grained error detection , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

work page 2024

[30] [30]

No Language Left Behind: Scaling Human-Centered Machine Translation

No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

LLaMA Pro: Progressive LLaMA with Block Expansion , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[32] [32]

Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,

Higher layers need more lora experts , author=. arXiv preprint arXiv:2402.08562 , year=

work page arXiv

[33] [33]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022

[34] [34]

Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi , booktitle=

Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Evan Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A....

work page 2025

[35] [35]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Do llamas work in english? on the latent language of multilingual transformers , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[36] [36]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[37] [37]

2023 , eprint=

Skywork: A More Open Bilingual Foundation Model , author=. 2023 , eprint=

work page 2023

[38] [38]

Advances in Neural Information Processing Systems , volume=

Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[39] [39]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

International conference on machine learning , pages=

Glam: Efficient scaling of language models with mixture-of-experts , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022

[41] [41]

Proceedings of the 36th International Conference on Neural Information Processing Systems , pages=

Training compute-optimal large language models , author=. Proceedings of the 36th International Conference on Neural Information Processing Systems , pages=

work page

[42] [42]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[43] [43]

2021 , url=

Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle=. 2021 , url=

work page 2021

[44] [44]

Forty-second International Conference on Machine Learning , year=

Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts , author=. Forty-second International Conference on Machine Learning , year=

work page

[45] [45]

arXiv preprint arXiv:2506.12388 , year=

Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model , author=. arXiv preprint arXiv:2506.12388 , year=

work page arXiv

[46] [46]

arXiv preprint arXiv:2505.22582 , year=

Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts , author=. arXiv preprint arXiv:2505.22582 , year=

work page arXiv

[47] [47]

Forty-first International Conference on Machine Learning , year=

Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=

work page

[48] [48]

arXiv preprint arXiv:2502.16770 , year=

Led-merging: Mitigating safety-utility conflicts in model merging with location-election-disjoint , author=. arXiv preprint arXiv:2502.16770 , year=

work page arXiv

[49] [49]

Advances in Neural Information Processing Systems , volume=

Ties-merging: Resolving interference when merging models , author=. Advances in Neural Information Processing Systems , volume=

work page

[50] [50]

arXiv preprint arXiv:2503.20641 , year=

Unlocking efficient long-to-short llm reasoning with model merging , author=. arXiv preprint arXiv:2503.20641 , year=

work page arXiv

[51] [51]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

The Thirteenth International Conference on Learning Representations , year=

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[53] [53]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[54] [54]

Language models can self-lengthen to generate long texts

Language models can self-lengthen to generate long texts , author=. arXiv preprint arXiv:2410.23933 , year=

work page arXiv

[55] [55]

DeepSeek-AI, D

Aya expanse: Combining research breakthroughs for a new multilingual frontier , author=. arXiv preprint arXiv:2412.04261 , year=

work page arXiv