Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation

Aditi Khandelwal; Golnoosh Farnadi; Marius Mosbach; Siva Reddy; Verna Dankers

arxiv: 2605.29714 · v1 · pith:VP5AKV6Wnew · submitted 2026-05-28 · 💻 cs.CL

Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation

Aditi Khandelwal , Marius Mosbach , Verna Dankers , Siva Reddy , Golnoosh Farnadi This is my paper

Pith reviewed 2026-06-29 07:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords mixture-of-expertsmultilingual adaptationrouting dynamicsparameter-efficient adaptationcontinual pre-traininglanguage specializationfinal layers

0 comments

The pith

Language specialization in MoE models concentrates in final layers, allowing adaptation by updating under 2 percent of parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks how an English-centric Mixture-of-Experts model routes tokens when continually pre-trained on a multilingual corpus. Early and middle layers develop language-agnostic routing while the final layers acquire language-specific experts, with vocabulary overlap between languages shaping the sharing pattern. From this observation the authors derive an adaptation method that selectively updates experts only in those final layers. On MultiBLiMP and Belebele the resulting models reach performance comparable to full final-layer fine-tuning yet modify fewer than 2 percent of total parameters. The work therefore supplies both an empirical picture of where multilingual specialization appears and a practical route to low-resource adaptation.

Core claim

Continual multilingual pre-training produces diffused, language-agnostic routing in early and middle MoE layers, with language specialization emerging primarily in the final layers; token-level vocabulary overlap influences routing decisions, so selectively updating language-specific and shared experts in the final layers yields competitive multilingual performance while changing less than 2 percent of parameters.

What carries the argument

Final-layer expert routing, the locus where language specialization concentrates and can be targeted for parameter-efficient updates.

If this is right

Vocabulary overlap between languages directly modulates the degree of expert sharing in final layers.
Multilingual adaptation can be performed by touching only the last MoE blocks without retraining earlier layers.
The same final-layer focus supplies a concrete recipe for low-resource language extension of large MoE models.
Routing analysis during continual pre-training can locate the minimal set of parameters needed for language specialization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-wise specialization pattern may appear when MoE models are adapted to domains other than language.
If early-layer routing remains language-agnostic across many settings, future MoE designs could freeze those layers by default.
Measuring vocabulary overlap before adaptation could predict how many final-layer experts need language-specific copies.

Load-bearing premise

The routing patterns and final-layer specialization seen in this English-centric model and corpus will appear in other MoE architectures and language collections.

What would settle it

A replication on a different MoE architecture or language set in which final-layer-only updates fail to match the performance of full final-layer fine-tuning at comparable parameter budgets.

Figures

Figures reproduced from arXiv: 2605.29714 by Aditi Khandelwal, Golnoosh Farnadi, Marius Mosbach, Siva Reddy, Verna Dankers.

**Figure 2.** Figure 2: Comparison of routing entropy across layers [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Cross-lingual Routing Divergence in the Final Layer using Pairwise JSD for OLMoE-Base (left) and OLMoE-M7 (right). Darker blue indicates higher expert sharing. Bolded languages are high-resource; italicized languages are low-resource. Base Step-800 Step-1500 Step-2450 OLMoE-M7 0.15 0.20 0.25 0.30 0.35 0.40 JSD layer 0 2 5 7 10 12 15 all [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: JSD vs Token-Vocabulary Overlap between language pairs. Each point is a language pair and the outlined points with black edges indicate a few qualitative examples for high and low-resource language pairs. In these layers, routing similarity aligns closely with token-level vocabulary overlap, which can sometimes supersede typological markers in driving the router’s statistical behavior (e.g., the high rout… view at source ↗

**Figure 7.** Figure 7: Illustration of the “activation gap” procedure. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Average MultiBLiMP performance across target languages comparing different adaptation strategies (SEFT, SSFT) against baselines. Numbers in blue next to each bar indicate trainable parameters [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Token-Vocabulary Overlap across language [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of routing entropy across layers for OLMoE-Base (left) and OLMoE-M7 (right) across all [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Cross-lingual Routing Divergence in the Layer 13 using Pairwise Jensen-Shannon Divergence (JSD)for OLMoE-Base (left) and OLMoE-M7 (right). Darker blue indicates higher expert sharing. Lang. k=0 k=1 k=3 k=5 M7 Cat. 82.9 83.2 84.2 85.4 81.3 Est. 68.4 71.5 72.9 75.3 61.5 Mar. 66.5 67.0 72.4 73.7 68.5 Slk. 86.3 86.5 88.0 88.6 86.0 Ukr. 81.7 79.3 78.1 85.1 80.7 Urd. 86.5 84.7 86.4 93.3 83.3 [PITH_FULL_IMAGE:f… view at source ↗

**Figure 12.** Figure 12: Cross-lingual Routing Divergence in the Layer 14 using Pairwise Jensen-Shannon Divergence (JSD) for OLMoE-Base (left) and OLMoE-M7 (right). Darker blue indicates higher expert sharing. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models are widely used to scale language models, yet their expert routing behavior and adaptation in a multilingual setting remain underexplored. In this work, we study multilingual routing dynamics during continual pre-training of an English-centric MoE model on a multilingual corpus, analyzing how expert usage varies across languages. We find that continual multilingual pre-training leads to diffused, language-agnostic routing in early and middle layers, with language specialization primarily emerging in the final layers. We also show that token-level vocabulary overlap between languages plays an important role in how languages are routed. Motivated by these findings, we propose a parameter-efficient adaptation strategy that updates language-specific and shared experts in the final MoE layers. Experiments on MultiBLiMP and Belebele show that our method achieves a strong performance-efficiency trade-off, attaining competitive performance relative to fine-tuning complete final layers, while updating less than 2% of the parameters. Overall, our findings provide insights into where and how language specialization emerges in MoEs during continual pre-training and provide practical insights for low-resource multilingual adaptation. Our code is available at https://github.com/aditi184/moe-routing-adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They track how routing diffuses in an English MoE during multilingual continual pre-training, observe final-layer specialization plus vocab-overlap effects, and turn that into a final-layer-only adaptation method that hits competitive scores on MultiBLiMP and Belebele while touching <2% of parameters.

read the letter

The core contribution is the layer-wise routing measurements during continual pre-training of an English-centric MoE on multilingual data. Early and middle layers lose language-specific routing and become more diffuse, while specialization concentrates in the final layers; they also note that vocabulary overlap between languages influences routing patterns. From those observations they derive a simple adaptation approach that only updates language-specific and shared experts in the final MoE layers.

That measurement-plus-method package is the genuinely new piece. Prior MoE adaptation work has not zeroed in on this multilingual continual-pre-training regime or produced these particular routing diagnostics. Releasing the code is a plus and makes the empirical claims checkable.

The main soft spot is that the abstract gives almost no experimental detail: no mention of statistical significance, exact baseline configurations, hyper-parameter choices, or how they handled potential confounds in the continual pre-training setup. The reported performance-efficiency trade-off therefore rests on results whose robustness is hard to assess from what is shown. The generalization claim—that the observed final-layer specialization will hold for other MoE architectures and language sets—is also untested here and could be fragile.

This is the kind of paper that belongs in a reading group for people working on efficient multilingual adaptation of large MoEs. It is not reshaping the field, but the routing analysis is concrete enough and the method is lightweight enough that a serious referee could usefully pressure-test the numbers and the scope. I would send it out for review rather than desk-reject, mainly to get the experimental details clarified.

Referee Report

1 major / 2 minor

Summary. The paper studies routing dynamics in an English-centric MoE model during continual pre-training on multilingual data. It reports that early and middle layers exhibit diffused, language-agnostic routing while language specialization concentrates in the final layers, modulated by token-level vocabulary overlap. Motivated by these observations, the authors propose a parameter-efficient adaptation approach that selectively updates language-specific and shared experts only in the final MoE layers. Experiments on MultiBLiMP and Belebele are presented as evidence that the method attains competitive performance relative to full final-layer fine-tuning while updating fewer than 2% of parameters. Code is released at the provided GitHub link.

Significance. If the experimental claims hold, the work supplies concrete empirical observations on where language specialization emerges in MoE routing and demonstrates a practical, low-parameter adaptation recipe for multilingual settings. The public release of code is a clear strength that aids reproducibility and follow-up work.

major comments (1)

[Experimental evaluation] Experimental evaluation (results on MultiBLiMP and Belebele): the central efficiency claim—that competitive performance is achieved while updating <2% of parameters—is only moderately supported because the manuscript provides no details on statistical significance testing, run-to-run variance, exact baseline configurations, hyper-parameter choices, or potential confounds arising from the continual pre-training protocol.

minor comments (2)

[Abstract] The abstract introduces the adaptation strategy only after the routing findings; a one-sentence description of the final-layer expert update rule would improve immediate readability.
[Method] The manuscript would benefit from an explicit statement of the precise fraction of parameters updated (e.g., number of experts and layers involved) rather than the aggregate “less than 2%” figure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive summary, the significance assessment, and the recommendation for minor revision. We address the single major comment below and will strengthen the experimental reporting accordingly.

read point-by-point responses

Referee: [Experimental evaluation] Experimental evaluation (results on MultiBLiMP and Belebele): the central efficiency claim—that competitive performance is achieved while updating <2% of parameters—is only moderately supported because the manuscript provides no details on statistical significance testing, run-to-run variance, exact baseline configurations, hyper-parameter choices, or potential confounds arising from the continual pre-training protocol.

Authors: We agree that the current manuscript omits these details, which limits the strength of the efficiency claims. In the revision we will add a new subsection to the Experimental Setup that (1) reports mean and standard deviation over three random seeds for all MultiBLiMP and Belebele scores, (2) includes paired t-test p-values for the key comparisons against full final-layer fine-tuning, (3) provides a table of all hyper-parameters and exact baseline configurations, and (4) explicitly discusses the continual pre-training protocol and any controls used to mitigate confounds. These additions will be placed before the results tables so readers can directly assess the reported performance-efficiency trade-off. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper performs empirical analysis of routing dynamics in an MoE model during continual pre-training, observes language specialization patterns in final layers, and then proposes and evaluates a parameter-efficient adaptation method motivated by those observations. All central claims (routing patterns, vocabulary overlap effects, performance-efficiency trade-offs on MultiBLiMP/Belebele) rest on direct experimental measurements rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study; its central claim rests on standard assumptions about benchmark validity and training dynamics rather than new mathematical axioms or invented entities.

axioms (1)

domain assumption MultiBLiMP and Belebele are valid proxies for measuring cross-lingual language understanding after adaptation.
The performance-efficiency claim is evaluated exclusively on these two benchmarks.

pith-pipeline@v0.9.1-grok · 5755 in / 1214 out tokens · 24855 ms · 2026-06-29T07:30:05.208828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 2 internal anchors

[1]

GLU Variants Improve Transformer

Enhancing multilingual LLM pretraining with model-based data selection.Advances in Neural In- formation Processing Systems, 38. Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Evan Pete Walsh, Oyvind Tafjord, Nathan Lambert, and 1 others. OLMoE: Open mixture-of-experts lan- guage models. InThe Thirteenth...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

OpenMoE: An early effort on open mixture- of-experts language models. InInternational Con- ference on Machine Learning, pages 55625–55655. PMLR. Guorui Zheng, Xidong Wang, Juhao Liang, Nuo Chen, Yuping Zheng, and Benyou Wang. 2025. Efficiently democratizing medical LLMs for 50 languages via a mixture of language family experts. InThe Thir- teenth Internat...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

GLU Variants Improve Transformer

Enhancing multilingual LLM pretraining with model-based data selection.Advances in Neural In- formation Processing Systems, 38. Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Evan Pete Walsh, Oyvind Tafjord, Nathan Lambert, and 1 others. OLMoE: Open mixture-of-experts lan- guage models. InThe Thirteenth...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

OpenMoE: An early effort on open mixture- of-experts language models. InInternational Con- ference on Machine Learning, pages 55625–55655. PMLR. Guorui Zheng, Xidong Wang, Juhao Liang, Nuo Chen, Yuping Zheng, and Benyou Wang. 2025. Efficiently democratizing medical LLMs for 50 languages via a mixture of language family experts. InThe Thir- teenth Internat...

work page internal anchor Pith review Pith/arXiv arXiv 2025