Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs
Pith reviewed 2026-05-10 04:38 UTC · model grok-4.3
The pith
Mixture-of-Experts models keep individual experts polysemantic but turn their routing paths into monosemantic trajectories by isolating an abstract control signal from surface content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer's hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features in the content channel while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e
What carries the argument
The parameter-free orthogonal decomposition of each layer's hidden state into a routing-driving control signal and an invisible content channel.
If this is right
- Tokens such as ":" follow distinct trajectories depending on whether they act as type annotations, introductory colons, or time separators.
- Control-subspace clusters are substantially more monosemantic than clusters in the full representation.
- Low-bandwidth routing produces compositional specialization across layers.
- Surface-level features are preserved in the content channel while abstract functions rotate in the control signal.
Where Pith is reading between the lines
- Interpretability methods for MoEs should track full trajectories rather than inspect experts in isolation.
- The split could be adapted to identify analogous control signals in dense models or other architectures.
- Targeted edits to the control signal might allow steering of semantic behavior without changing the content channel.
Load-bearing premise
The orthogonal split isolates a causally driving control signal from content invisible to the router.
What would settle it
Running the decomposition on a new MoE model or token set and finding that paths fail to form clusters by semantic function or that the control signal does not causally determine routing decisions.
Figures
read the original abstract
An LLM's residual stream is both state and instruction: it encodes the current context and determines the next transformation. We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer's hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features (language, token identity, position) in the content channel, while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e.g., ":") follows distinct trajectories depending on whether it serves as a type annotation, an introductory colon, or a time separator. Our decomposition identifies the source of this structure: clusters in the control subspace are substantially more monosemantic than those in the full representation. As a result, the natural unit of interpretability in MoEs is not the expert but the trajectory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a parameter-free orthogonal decomposition of the residual stream in Mixture-of-Experts (MoE) models that isolates a control signal (which drives routing) from an orthogonal content channel (invisible to the router). Across six MoE architectures, it reports that surface features (language, token identity, position) are preserved in the content channel while the control signal encodes abstract functions that rotate layer to layer, inducing compositional specialization. Individual experts remain polysemantic, but expert paths (trajectories) become monosemantic, clustering tokens by semantic function (e.g., distinct trajectories for the token ':' in different syntactic roles). The manuscript concludes that the natural unit of interpretability in MoEs is the trajectory rather than the expert.
Significance. If the decomposition and monosemantic-path observations hold under broader testing, the work offers a concrete, parameter-free tool for mechanistic interpretability of MoEs by reframing routing as a low-bandwidth control mechanism. The cross-architecture consistency and the explicit separation of control versus content are notable strengths that could guide future analyses of routing dynamics and compositional behavior in large models.
major comments (2)
- [Abstract and §3] Abstract and §3 (Decomposition): the claim that control-subspace clusters are 'substantially more monosemantic' than full or content representations requires a quantitative monosemanticity metric, statistical tests, and controls for token selection; none are supplied in the abstract and the procedure details are insufficient to verify whether the orthogonal split truly isolates a causally driving signal from downstream content effects.
- [§4 and §5] §4 (Empirical Results) and §5 (Discussion): the conclusion that trajectories are the natural interpretability unit rests on observations from only six architectures and illustrative tokens (e.g., ':'); without a defined generalization test or broader quantitative evaluation across additional models and a larger token set, the monosemantic-path clustering may be sensitive to architecture-specific router properties or post-hoc selection.
minor comments (2)
- [§2] Notation for the control/content split (e.g., the exact orthogonality condition and projection) should be stated with an explicit equation early in §2 to aid reproducibility.
- [Figures] Figure captions for trajectory visualizations should include the exact token contexts and layer indices shown, rather than relying on the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, proposing specific revisions to strengthen the quantitative aspects of our claims while maintaining the core contributions.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Decomposition): the claim that control-subspace clusters are 'substantially more monosemantic' than full or content representations requires a quantitative monosemanticity metric, statistical tests, and controls for token selection; none are supplied in the abstract and the procedure details are insufficient to verify whether the orthogonal split truly isolates a causally driving signal from downstream content effects.
Authors: We agree that the presentation in the abstract and §3 relies on qualitative evidence from clustering visualizations and specific token examples to support the monosemanticity of control-subspace clusters. The decomposition is parameter-free and orthogonal by construction, and we show that routing decisions depend only on the control component through ablation experiments. However, to provide a more rigorous quantification, we will revise §3 to include a monosemanticity metric defined as the average purity of semantic function clusters (using annotated token roles), along with statistical tests comparing control, content, and full representations. We will also detail the token selection procedure and controls to confirm the causal isolation of the routing signal. revision: yes
-
Referee: [§4 and §5] §4 (Empirical Results) and §5 (Discussion): the conclusion that trajectories are the natural interpretability unit rests on observations from only six architectures and illustrative tokens (e.g., ':'); without a defined generalization test or broader quantitative evaluation across additional models and a larger token set, the monosemantic-path clustering may be sensitive to architecture-specific router properties or post-hoc selection.
Authors: The results are based on six MoE architectures chosen for their diversity in scale and design, and the token examples are selected to illustrate the key phenomenon of function-specific trajectories. We recognize that this is illustrative rather than exhaustive. In revision, we will add a quantitative evaluation of path monosemanticity across a broader sample of tokens (e.g., 100 common tokens) using a clustering coherence score, and include a discussion of potential architecture sensitivities. While we cannot exhaustively test all possible MoEs, the consistency across the tested models supports the claim, and we will frame the conclusion more cautiously as a hypothesis for future work. revision: partial
Circularity Check
No significant circularity: empirical decomposition is self-contained
full rationale
The paper introduces a parameter-free orthogonal decomposition of hidden states into control and content channels, then reports empirical observations of monosemantic path clustering across six architectures. No equations, fitted parameters, or derivations are presented that reduce to their own inputs by construction. The central claim that trajectories (not experts) are the natural interpretability unit follows directly from the observed clustering results rather than from any self-definition, self-citation chain, or renamed prior result. The analysis contains no load-bearing self-citations or ansatzes smuggled via prior work by the same authors.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[4]
Jiang, Albert Q and Sablayrolles, Alexandre and Roux, Antoine and Mensch, Arthur and Savary, Blanche and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Hanna, Emma Bou and Bressand, Florian and others , journal=
-
[5]
Xue, Fuzhao and Zheng, Zian and Fu, Yao and Ni, Jinjie and Zheng, Zangwei and Zhou, Wangchunshu and You, Yang , journal=
-
[6]
Dai, Damai and Deng, Chengqi and Zhao, Chenggang and Xu, RX and Gao, Huazuo and Chen, Deli and Li, Jiashi and Zeng, Wangding and Yu, Xingkai and Wu, Yu and others , journal=
- [7]
-
[8]
SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs? , author=. 2025 , eprint=
work page 2025
-
[9]
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification , author=. 2025 , eprint=
work page 2025
-
[10]
Komatsuzaki, Aran and Puigcerver, Joan and Lee-Thorp, James and Ruiz, Carlos Riquelme and Mustafa, Basil and Ainslie, Joshua and Tay, Yi and Dehghani, Mostafa and Houlsby, Neil , journal=
-
[11]
Fan, Dongyang and Messmer, Bettina and Jaggi, Martin , journal=
-
[12]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[13]
Muennighoff, Niklas and Soldaini, Luca and Groeneveld, Dirk and Lo, Kyle and Morrison, Jacob and Min, Sewon and Shi, Weijia and Walsh, Pete and Tafjord, Oyvind and Lambert, Nathan and others , journal=
-
[14]
Park, Jungwoo and Ahn, Young Jin and Kim, Kee-Eung and Kang, Jaewoo , journal=
-
[15]
Oldfield, James and Georgopoulos, Markos and Chrysos, Grigorios and Tzelepis, Christos and Panagakis, Yannis and Nicolaou, Mihalis and Deng, Jiankang and Patras, Ioannis , journal=
-
[16]
International Conference on Learning Representations , year=
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations , year=
-
[17]
Fedus, William and Zoph, Barret and Shazeer, Noam , journal=
-
[18]
Language models can explain neurons in language models , author=. 2023 , howpublished =
work page 2023
-
[19]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=
work page 2023
-
[20]
Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =
-
[21]
Open Problems in Mechanistic Interpretability , author=. 2025 , eprint=
work page 2025
- [22]
-
[23]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=
work page 2025
- [24]
-
[25]
Sparse Autoencoders Find Hinghly Interpretable Features in Language Models , author=. 2023 , eprint=
work page 2023
-
[26]
and Hume, Tristan and Carter, Shan and Henighan, Tom and Olah, Christopher , journal =
Bricken, Trenton and Templeton, Adly and Batson, Joshua and Chen, Brian and Jermyn, Adam and Conerly, Tom and Turner, Nick and Anil, Cem and Denison, Carson and Askell, Amanda and Lasenby, Robert and Wu, Yifan and Kravec, Shauna and Schiefer, Nicholas and Maxwell, Tim and Joseph, Nicholas and Hatfield-Dodds, Zac and Tamkin, Alex and Nguyen, Karina and McL...
work page 2023
-
[27]
Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, ...
work page 2021
-
[28]
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders , author=. 2025 , eprint=
work page 2025
-
[29]
Applying sparse autoencoders to unlearn knowledge in language models , author=. 2024 , eprint=
work page 2024
-
[30]
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing , author=. 2025 , eprint=
work page 2025
-
[31]
Mixture of Experts Made Intrinsically Interpretable , author=. 2025 , eprint=
work page 2025
-
[32]
Probing Semantic Routing in Large Mixture-of-Expert Models , author=. 2025 , eprint=
work page 2025
-
[33]
Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis , author=. 2025 , eprint=
work page 2025
-
[34]
A Closer Look into Mixture-of-Experts in Large Language Models , author=. 2025 , eprint=
work page 2025
-
[35]
MoE Lens - An Expert Is All You Need , author=. Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference , year=
-
[36]
Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs , author=. 2025 , eprint=
work page 2025
-
[37]
Ziyue Li and Tianyi Zhou , booktitle=. Your Mixture-of-Experts. 2025 , url=
work page 2025
-
[38]
Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free , author=. 2024 , eprint=
work page 2024
-
[39]
The Remarkable Robustness of LLMs: Stages of Inference? , author=. 2024 , eprint=
work page 2024
- [40]
-
[41]
2025 , month = apr, howpublished =
work page 2025
-
[42]
On the Spatial Structure of Mixture-of-Experts in Transformers , author=. 2025 , eprint=
work page 2025
-
[43]
Part-Of-Speech Sensitivity of Routers in Mixture of Experts Models , author=. 2024 , eprint=
work page 2024
-
[44]
2025 , month = apr, howpublished =
The Llama 4 herd: The beginning of a new era of natively multimodal. 2025 , month = apr, howpublished =
work page 2025
-
[45]
Semantic Specialization in MoE Appears with Scale: A Study of DeepSeek R1 Expert Specialization , author=. 2025 , eprint=
work page 2025
-
[46]
Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training , author=. 2025 , eprint=
work page 2025
-
[47]
ST-MoE: Designing Stable and Transferable Sparse Expert Models , author=. 2022 , eprint=
work page 2022
-
[48]
The Thirteenth International Conference on Learning Representations , year=
Tight Clusters Make Specialized Experts , author=. The Thirteenth International Conference on Learning Representations , year=
-
[49]
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) , author=. 2025 , eprint=
work page 2025
-
[50]
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference , author=. 2025 , eprint=
work page 2025
- [51]
-
[52]
Artificial Analysis: AI Model & API Providers Analysis , howpublished =
-
[53]
Proceedings of the 57th Conference of the Association for Computational Linguistics,
Tenney, Ian and Das, Dipanjan and Pavlick, Ellie. BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1452
-
[54]
Methods for Visual Understanding of Hierarchical System Structures , year=
Sugiyama, Kozo and Tagawa, Shojiro and Toda, Mitsuhiko , journal=. Methods for Visual Understanding of Hierarchical System Structures , year=
-
[55]
Sugiyama, Kozo and Tagawa, Shojiro and Toda, Mitsuhiko , journal=. 2007 , publisher=
work page 2007
- [56]
- [57]
-
[58]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. 2025 , eprint=
work page 2025
-
[59]
OLMoE: Open Mixture-of-Experts Language Models , author=. 2025 , eprint=
work page 2025
-
[60]
Granite Code Models: A Family of Open Foundation Models for Code Intelligence , author=. 2025 , eprint=
work page 2025
-
[61]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.