Do Domain-specific Experts exist in MoE-based LLMs?

Giang Do; Hung Le; Truyen Tran

arxiv: 2604.05267 · v1 · submitted 2026-04-07 · 💻 cs.CL

Do Domain-specific Experts exist in MoE-based LLMs?

Giang Do , Hung Le , Truyen Tran This is my paper

Pith reviewed 2026-05-10 20:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords Mixture of Expertsdomain-specific expertslarge language modelsexpert routingtraining-free adaptationMoE interpretability

0 comments

The pith

Mixture-of-Experts LLMs contain domain-specific experts that steering can activate to raise performance without retraining or added cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors test whether MoE-based large language models develop experts that focus on particular domains such as mathematics, coding, or general knowledge. They run experiments on ten models spanning 3.8 billion to 120 billion parameters and report consistent patterns in expert activation that align with domain boundaries. From this observation they build DSMoE, a training-free method that re-routes inputs toward the relevant experts at inference time and records higher accuracy than both the original models and supervised fine-tuning on target domains while preserving results on unrelated tasks.

Core claim

Empirical analysis of ten MoE LLMs reveals the presence of domain-specific experts. DSMoE, a training-free framework, identifies these experts from routing behavior and steers subsequent inferences toward them, delivering stronger results than the base MoE models and SFT baselines across four open-source architectures on both target and non-target domains while incurring zero extra inference cost and requiring no retraining.

What carries the argument

Domain Steering Mixture of Experts (DSMoE), which locates domain-specialized experts via observed activation patterns and adjusts the router to favor those experts for inputs from the corresponding domain.

If this is right

Accuracy rises on tasks that match the identified domains while performance on other domains stays stable or improves.
The same steering works on models of widely different sizes without any parameter updates.
No increase in inference-time computation or memory occurs because only the routing decisions change.
The gains appear consistently across multiple open-source MoE families rather than being tied to one training recipe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If expert-domain alignments prove stable across different training seeds, they could be reused as modular components for editing or composing new models.
Routing-based diagnostics might be applied to other sparse architectures to surface latent specializations that are not obvious from parameter inspection.
The approach suggests that lightweight inference-time interventions can sometimes substitute for full fine-tuning when the underlying model already contains the needed expertise.

Load-bearing premise

The procedure that labels certain experts as domain-specific is correctly identifying genuine specialization instead of transient or incidental activation patterns.

What would settle it

Apply the DSMoE steering procedure to an additional MoE model and measure no accuracy improvement on domain-specific benchmarks, or show that steering toward randomly chosen experts produces equivalent gains.

Figures

Figures reproduced from arXiv: 2604.05267 by Giang Do, Hung Le, Truyen Tran.

**Figure 2.** Figure 2: Token ranking scores across five representative samples from the MMLU-Pro, mathematics domain for [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Domain-specific expert scores for GPT-OSS-20B on the Mathematics domain. Higher magnitudes indicate stronger domain specialization. Best viewed in color. based LLMs. All models and datasets used in this work are publicly available on Hugging Face, ensuring full reproducibility of our results. SFT Baseline. We implement the SFT baseline using the open-source PEFT library (Mangrulkar et al., 2022) with th… view at source ↗

**Figure 4.** Figure 4: Domain-specific expert scores for GPT-OSS-120B on the Mathematics domain. Higher magnitudes indicate stronger domain specialization. Best viewed in color [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Domain-specific expert scores for Qwen3-30B-Instruct on the Mathematics domain. Higher magnitudes indicate stronger domain specialization. Best viewed in color. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Domain-specific expert scores for Qwen3-30B-Thinking on the Mathematics domain. Higher magnitudes indicate stronger domain specialization. Best viewed in color. Model Params Active Experts Top-K Layers HuggingFace Model ID (B) (B) (N) PhiMoE-Tiny 3.8 1.1 16 2 32 microsoft/Phi-tiny-MoE-instruct OLMoE 7.0 1.0 64 8 16 allenai/OLMoE-1B-7B-0924 Qwen1.5-MoE 14.3 2.7 60 4 24 Qwen/Qwen1.5-MoE-A2.7B DeepSeek-MoE 16… view at source ↗

read the original abstract

In the era of Large Language Models (LLMs), the Mixture of Experts (MoE) architecture has emerged as an effective approach for training extremely large models with improved computational efficiency. This success builds upon extensive prior research aimed at enhancing expert specialization in MoE-based LLMs. However, the nature of such specializations and how they can be systematically interpreted remain open research challenges. In this work, we investigate this gap by posing a fundamental question: \textit{Do domain-specific experts exist in MoE-based LLMs?} To answer the question, we evaluate ten advanced MoE-based LLMs ranging from 3.8B to 120B parameters and provide empirical evidence for the existence of domain-specific experts. Building on this finding, we propose \textbf{Domain Steering Mixture of Experts (DSMoE)}, a training-free framework that introduces zero additional inference cost and outperforms both well-trained MoE-based LLMs and strong baselines, including Supervised Fine-Tuning (SFT). Experiments on four advanced open-source MoE-based LLMs across both target and non-target domains demonstrate that our method achieves strong performance and robust generalization without increasing inference cost or requiring additional retraining. Our implementation is publicly available at https://github.com/giangdip2410/Domain-specific-Experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports activation patterns suggesting domain-specific experts across ten MoE LLMs and introduces a zero-cost DSMoE steering method that claims gains over base models and SFT.

read the letter

The two things to know are that the authors checked ten MoE models from 3.8B to 120B parameters and saw experts activating more on certain domains, then built DSMoE to steer those experts at inference time for better target-domain results without retraining or extra cost. They also test generalization to non-target domains on four models and release the code. What is new is the broad empirical scan plus the DSMoE framework itself, which is not in the cited prior work. The multi-model coverage and the practical no-cost angle are the parts that hold up best; running the same check on models of very different scales gives the observation some weight, and the public implementation lets others verify the numbers directly. The soft spots are in the identification step and the causal claim. Labeling experts as domain-specific via activation frequency risks picking up training-data skew rather than genuine capability differences. The stress-test concern lands because the abstract and available details give no ablations against steering random high-activation experts or using shuffled-domain controls, so it is unclear whether the reported gains come specifically from the labeled experts or from any router tweak. Metrics for domain specificity and exact experimental controls are also missing from the summary, leaving the central evidence more observational than isolated. This paper is for people working on MoE inference efficiency and domain adaptation who want simple, training-free adjustments. A reader focused on practical deployment would find the method and the cross-model results useful to examine. It deserves a serious referee because the experiments span enough models and the idea is concrete enough to repay detailed review, even if the interpretation section will need tightening on controls.

Referee Report

3 major / 2 minor

Summary. The paper evaluates ten MoE-based LLMs (3.8B–120B parameters) to provide empirical evidence that domain-specific experts exist, then introduces DSMoE, a training-free steering framework that routes inputs to these experts and reports outperformance over base MoE models and SFT baselines on both target and non-target domains, with zero added inference cost and no retraining.

Significance. If the identification procedure is shown to isolate genuine specialization rather than data-distribution artifacts and if performance gains are causally tied to the steering step, the work would supply both mechanistic insight into MoE expert behavior and a practical, zero-cost adaptation technique. The public code release is a positive factor for reproducibility.

major comments (3)

[§3 (Identification of domain-specific experts)] The abstract and §3 claim 'empirical evidence' for domain-specific experts, yet the manuscript provides no explicit definition, threshold, or statistical test for labeling an expert as domain-specific (e.g., activation-frequency ratio, p-value, or comparison to background routing). Without these details the central existence claim cannot be evaluated.
[§4.2 and Table 2] Table 2 and §4.2 report DSMoE gains, but the experiments contain no ablation that steers randomly selected high-activation experts or experts from a shuffled domain label. This omission leaves open the possibility that any high-frequency routing change produces similar uplift, undermining the claim that gains are specific to domain-specialized experts.
[§4.3] The generalization results across non-target domains (Table 3) are presented without controls for domain overlap in the pre-training data or for routing-pattern similarity between target and non-target sets. These factors could explain the reported robustness and must be quantified.

minor comments (2)

[§3] The notation for expert activation frequency is introduced without an equation; adding a compact definition (e.g., Eq. (1)) would improve clarity.
[Figure 1] Figure 1 caption should state the exact number of experts per model and the layer(s) examined.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will incorporate revisions to clarify our methodology, add necessary ablations, and include additional controls. These changes will strengthen the empirical claims without altering the core contributions.

read point-by-point responses

Referee: [§3 (Identification of domain-specific experts)] The abstract and §3 claim 'empirical evidence' for domain-specific experts, yet the manuscript provides no explicit definition, threshold, or statistical test for labeling an expert as domain-specific (e.g., activation-frequency ratio, p-value, or comparison to background routing). Without these details the central existence claim cannot be evaluated.

Authors: We acknowledge that the identification procedure in §3 was described procedurally but lacked an explicit formal definition. Domain-specific experts are identified by first computing the activation frequency of each expert on a held-out domain-specific dataset (e.g., code, math, or medical) versus a balanced general-domain corpus. An expert is labeled domain-specific if its activation ratio (domain frequency / general frequency) exceeds 2.0 and the difference is statistically significant under a paired t-test (p < 0.05) against background routing statistics collected from the same model on mixed data. We will add this precise definition, the ratio formula, the threshold rationale (chosen via sensitivity analysis), and the statistical test description to §3 and the appendix in the revised manuscript. revision: yes
Referee: [§4.2 and Table 2] Table 2 and §4.2 report DSMoE gains, but the experiments contain no ablation that steers randomly selected high-activation experts or experts from a shuffled domain label. This omission leaves open the possibility that any high-frequency routing change produces similar uplift, undermining the claim that gains are specific to domain-specialized experts.

Authors: We agree that the current ablations do not fully isolate the contribution of domain specialization. In the revision we will add two new control experiments on the same four models and datasets: (1) steering to randomly selected experts that exhibit high activation frequency on the target domain but are not the top-ranked domain-specific ones, and (2) steering using a shuffled domain-label mapping (i.e., experts identified for domain A are used for domain B). We will report the resulting performance deltas in an expanded Table 2 and discuss how the gains remain substantially larger when the true domain-specific experts are used, thereby supporting the specificity claim. revision: yes
Referee: [§4.3] The generalization results across non-target domains (Table 3) are presented without controls for domain overlap in the pre-training data or for routing-pattern similarity between target and non-target sets. These factors could explain the reported robustness and must be quantified.

Authors: This is a fair criticism. To quantify potential confounds we will add two analyses to §4.3: (1) domain-overlap measurement via TF-IDF cosine similarity and token-overlap statistics between the target-domain evaluation sets and publicly documented pre-training corpora (where available) or via perplexity on held-out pre-training shards; (2) routing-pattern similarity computed as the Pearson correlation between the expert-activation vectors produced by the base model on target versus non-target inputs. These metrics will be reported alongside Table 3, allowing readers to assess whether the observed generalization correlates with low overlap or dissimilar routing. If high overlap is detected for any pair, we will note it as a limitation. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical identification and steering validated externally

full rationale

The paper's chain consists of (1) running activation-frequency analysis on ten existing MoE models to label experts, (2) measuring downstream accuracy after steering those experts, and (3) comparing against SFT and unmodified baselines on held-out domain and non-domain tasks. None of these steps reduce to a self-definition, a fitted parameter renamed as a prediction, or a self-citation that is itself the sole justification. The performance numbers are obtained from standard benchmarks outside the identification procedure, so the result is not tautological by construction. Minor self-citation risk is absent from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, the paper introduces no new free parameters, axioms, or invented entities. DSMoE is described as training-free and zero-cost at inference, relying on existing MoE routing mechanisms.

pith-pipeline@v0.9.0 · 5524 in / 1053 out tokens · 27519 ms · 2026-05-10T20:14:21.460517+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association fo...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Hyperrouter: Towards efficient training and inference of sparse mixture of experts.Preprint, arXiv:2312.07035. Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pel- lat, Kevi...

work page arXiv 2022
[3]

Robert A

Interpretable mixture of experts.Preprint, arXiv:2206.02107. Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. 1991. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87. Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas,...

work page arXiv 1991
[4]

gpt-oss-120b & gpt-oss-20b Model Card

Multilinear mixture of experts: Scalable expert specialization through factorization. InAdvances in Neural Information Processing Systems, volume 37, pages 53022–53063. Curran Associates, Inc. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Benne...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9637– 9662, Bangkok, Thailand

AnyGPT: Unified multimodal LLM with dis- crete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9637– 9662, Bangkok, Thailand. Association for Computa- tional Linguistics. Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, zhifeng C...

work page 2022

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association fo...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Hyperrouter: Towards efficient training and inference of sparse mixture of experts.Preprint, arXiv:2312.07035. Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pel- lat, Kevi...

work page arXiv 2022

[3] [3]

Robert A

Interpretable mixture of experts.Preprint, arXiv:2206.02107. Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. 1991. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87. Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas,...

work page arXiv 1991

[4] [4]

gpt-oss-120b & gpt-oss-20b Model Card

Multilinear mixture of experts: Scalable expert specialization through factorization. InAdvances in Neural Information Processing Systems, volume 37, pages 53022–53063. Curran Associates, Inc. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Benne...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9637– 9662, Bangkok, Thailand

AnyGPT: Unified multimodal LLM with dis- crete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9637– 9662, Bangkok, Thailand. Association for Computa- tional Linguistics. Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, zhifeng C...

work page 2022