pith. sign in

arxiv: 2604.17837 · v1 · submitted 2026-04-20 · 💻 cs.AI · cs.CL· cs.LG

Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs

Pith reviewed 2026-05-10 04:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords mixture of expertsinterpretabilityroutingpolysemanticmonosemantictrajectoriescontrol signaldecomposition
0
0 comments X

The pith

Mixture-of-Experts models keep individual experts polysemantic but turn their routing paths into monosemantic trajectories by isolating an abstract control signal from surface content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a parameter-free decomposition that splits each layer's hidden state into a control signal driving the router and an orthogonal content channel the router cannot access. Surface features such as language, token identity, and position remain in the content channel while the control signal encodes an abstract function that rotates layer to layer. Low-bandwidth routing then forces compositional specialization across layers. Individual experts stay polysemantic, yet the full paths tokens follow cluster by semantic function across languages and surface forms. The decomposition shows that clusters in the control subspace are more monosemantic than those in the full representation, so the trajectory becomes the natural unit of interpretability.

Core claim

We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer's hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features in the content channel while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e

What carries the argument

The parameter-free orthogonal decomposition of each layer's hidden state into a routing-driving control signal and an invisible content channel.

If this is right

  • Tokens such as ":" follow distinct trajectories depending on whether they act as type annotations, introductory colons, or time separators.
  • Control-subspace clusters are substantially more monosemantic than clusters in the full representation.
  • Low-bandwidth routing produces compositional specialization across layers.
  • Surface-level features are preserved in the content channel while abstract functions rotate in the control signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interpretability methods for MoEs should track full trajectories rather than inspect experts in isolation.
  • The split could be adapted to identify analogous control signals in dense models or other architectures.
  • Targeted edits to the control signal might allow steering of semantic behavior without changing the content channel.

Load-bearing premise

The orthogonal split isolates a causally driving control signal from content invisible to the router.

What would settle it

Running the decomposition on a new MoE model or token set and finding that paths fail to form clusters by semantic function or that the control signal does not causally determine routing decisions.

Figures

Figures reproduced from arXiv: 2604.17837 by Bo Yuan, Charles Ye, Lee Sharkey.

Figure 1
Figure 1. Figure 1: One token, three programs. We plot 500 top-1 routing paths of the token : through 48 MoE layers of QWEN3-30B-A3B, separated by contextual function: Python type annotation (left), introductory colon (center), time separator (right). Despite identical token ID, the three uses produce distinct expert trajectories. Expert IDs are reordered at each layer using a shared Sugiyama￾style layout to minimize edge cro… view at source ↗
Figure 2
Figure 2. Figure 2: Control and content in an MoE layer. We decompose the residual stream hl into or￾thogonal components via SVD of the routing matrix. The router-visible component (h vis l , top) is the control signal: it alone determines expert selection. The router-blind component (h blind l , bottom) is the content: invisible to routing, but processed by the selected expert alongside h vis l . 3 HOW MOES ORGANIZE COMPUTAT… view at source ↗
Figure 3
Figure 3. Figure 3: Routing features are ephemeral. Probes predict the top-1 expert at the current layer E(l) or next layer E(l + 1) from each channel. The control signal h vis l predicts E(l) near-perfectly (dark orange, ∼99%) but carries almost no information about E(l+ 1) (light orange, ∼35%). The content channel h blind l shows the reverse: weak for the current layer (dark blue, ∼67%), it is the strongest predictor of the… view at source ↗
Figure 4
Figure 4. Figure 4: Expert paths are monosemantic. Each group shows tokens (highlighted) sharing an identical mid-layer expert path in gpt-oss-20b. Context is grayed; highlighted tokens are the routed unit. Paths cluster by function across languages and surface forms: overseas , 海外 , and 境外 share no lexical overlap but follow the same computation trace. The control subspace is the source. Our decomposition predicts this: if p… view at source ↗
Figure 5
Figure 5. Figure 5: Per-dimension hidden-state magnitude Mh l,d (x-axis) versus router-weight magnitude MR l,d (y-axis), both on log scales. One dot = one hidden dim d in layer l. Colors represent layers. Strong positive correlations are observed, indicating that router weights are largest on dimensions where the representation already has the highest magnitude. We show that routing is a surprisingly low-dimensional process. … view at source ↗
Figure 6
Figure 6. Figure 6: Probe accuracy versus fraction of highest-magnitude hidden dimensions used. Results [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average cross-layer cosine similarities C vis l and C blind l (shaded: 95% bootstrap CIs). 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Language probe. Normalized MI% between channel representations and language label. h blind l (blue) encodes language consistently; h vis l (orange) carries substantially less, especially at middle layers. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Token ID probe results. Normalized MI% between channel representations and token ID (filtered for top 100 token IDs). OlMoE Granite-4.0-Tiny Deepseek-v2-Lite Qwen3-30B-A3B gpt-oss-20b GLM-4.5-Air 10 10 20 30 40 10 20 10 20 30 40 10 20 10 20 30 40 0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 Layer index (l) MI % hvis(l) → sequence_position hblind(l) → sequence_position [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sequence position probe results. Normalized MI% between channel representations and token position ID. G COLON TOKEN DATASET To generate balanced data for [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

An LLM's residual stream is both state and instruction: it encodes the current context and determines the next transformation. We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer's hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features (language, token identity, position) in the content channel, while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e.g., ":") follows distinct trajectories depending on whether it serves as a type annotation, an introductory colon, or a time separator. Our decomposition identifies the source of this structure: clusters in the control subspace are substantially more monosemantic than those in the full representation. As a result, the natural unit of interpretability in MoEs is not the expert but the trajectory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a parameter-free orthogonal decomposition of the residual stream in Mixture-of-Experts (MoE) models that isolates a control signal (which drives routing) from an orthogonal content channel (invisible to the router). Across six MoE architectures, it reports that surface features (language, token identity, position) are preserved in the content channel while the control signal encodes abstract functions that rotate layer to layer, inducing compositional specialization. Individual experts remain polysemantic, but expert paths (trajectories) become monosemantic, clustering tokens by semantic function (e.g., distinct trajectories for the token ':' in different syntactic roles). The manuscript concludes that the natural unit of interpretability in MoEs is the trajectory rather than the expert.

Significance. If the decomposition and monosemantic-path observations hold under broader testing, the work offers a concrete, parameter-free tool for mechanistic interpretability of MoEs by reframing routing as a low-bandwidth control mechanism. The cross-architecture consistency and the explicit separation of control versus content are notable strengths that could guide future analyses of routing dynamics and compositional behavior in large models.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Decomposition): the claim that control-subspace clusters are 'substantially more monosemantic' than full or content representations requires a quantitative monosemanticity metric, statistical tests, and controls for token selection; none are supplied in the abstract and the procedure details are insufficient to verify whether the orthogonal split truly isolates a causally driving signal from downstream content effects.
  2. [§4 and §5] §4 (Empirical Results) and §5 (Discussion): the conclusion that trajectories are the natural interpretability unit rests on observations from only six architectures and illustrative tokens (e.g., ':'); without a defined generalization test or broader quantitative evaluation across additional models and a larger token set, the monosemantic-path clustering may be sensitive to architecture-specific router properties or post-hoc selection.
minor comments (2)
  1. [§2] Notation for the control/content split (e.g., the exact orthogonality condition and projection) should be stated with an explicit equation early in §2 to aid reproducibility.
  2. [Figures] Figure captions for trajectory visualizations should include the exact token contexts and layer indices shown, rather than relying on the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, proposing specific revisions to strengthen the quantitative aspects of our claims while maintaining the core contributions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Decomposition): the claim that control-subspace clusters are 'substantially more monosemantic' than full or content representations requires a quantitative monosemanticity metric, statistical tests, and controls for token selection; none are supplied in the abstract and the procedure details are insufficient to verify whether the orthogonal split truly isolates a causally driving signal from downstream content effects.

    Authors: We agree that the presentation in the abstract and §3 relies on qualitative evidence from clustering visualizations and specific token examples to support the monosemanticity of control-subspace clusters. The decomposition is parameter-free and orthogonal by construction, and we show that routing decisions depend only on the control component through ablation experiments. However, to provide a more rigorous quantification, we will revise §3 to include a monosemanticity metric defined as the average purity of semantic function clusters (using annotated token roles), along with statistical tests comparing control, content, and full representations. We will also detail the token selection procedure and controls to confirm the causal isolation of the routing signal. revision: yes

  2. Referee: [§4 and §5] §4 (Empirical Results) and §5 (Discussion): the conclusion that trajectories are the natural interpretability unit rests on observations from only six architectures and illustrative tokens (e.g., ':'); without a defined generalization test or broader quantitative evaluation across additional models and a larger token set, the monosemantic-path clustering may be sensitive to architecture-specific router properties or post-hoc selection.

    Authors: The results are based on six MoE architectures chosen for their diversity in scale and design, and the token examples are selected to illustrate the key phenomenon of function-specific trajectories. We recognize that this is illustrative rather than exhaustive. In revision, we will add a quantitative evaluation of path monosemanticity across a broader sample of tokens (e.g., 100 common tokens) using a clustering coherence score, and include a discussion of potential architecture sensitivities. While we cannot exhaustively test all possible MoEs, the consistency across the tested models supports the claim, and we will frame the conclusion more cautiously as a hypothesis for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical decomposition is self-contained

full rationale

The paper introduces a parameter-free orthogonal decomposition of hidden states into control and content channels, then reports empirical observations of monosemantic path clustering across six architectures. No equations, fitted parameters, or derivations are presented that reduce to their own inputs by construction. The central claim that trajectories (not experts) are the natural interpretability unit follows directly from the observed clustering results rather than from any self-definition, self-citation chain, or renamed prior result. The analysis contains no load-bearing self-citations or ansatzes smuggled via prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; it describes a parameter-free orthogonal decomposition but provides no explicit free parameters, axioms, or invented entities beyond standard MoE concepts.

pith-pipeline@v0.9.0 · 5498 in / 1228 out tokens · 55029 ms · 2026-05-10T04:38:18.630503+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    Jiang, Albert Q and Sablayrolles, Alexandre and Roux, Antoine and Mensch, Arthur and Savary, Blanche and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Hanna, Emma Bou and Bressand, Florian and others , journal=

  5. [5]

    Xue, Fuzhao and Zheng, Zian and Fu, Yao and Ni, Jinjie and Zheng, Zangwei and Zhou, Wangchunshu and You, Yang , journal=

  6. [6]

    Dai, Damai and Deng, Chengqi and Zhao, Chenggang and Xu, RX and Gao, Huazuo and Chen, Deli and Li, Jiashi and Zeng, Wangding and Yu, Xingkai and Wu, Yu and others , journal=

  7. [7]

    2025 , eprint=

    Steering MoE LLMs via Expert (De)Activation , author=. 2025 , eprint=

  8. [8]

    2025 , eprint=

    SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs? , author=. 2025 , eprint=

  9. [9]

    2025 , eprint=

    SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification , author=. 2025 , eprint=

  10. [10]

    Komatsuzaki, Aran and Puigcerver, Joan and Lee-Thorp, James and Ruiz, Carlos Riquelme and Mustafa, Basil and Ainslie, Joshua and Tay, Yi and Dehghani, Mostafa and Houlsby, Neil , journal=

  11. [11]

    Fan, Dongyang and Messmer, Bettina and Jaggi, Martin , journal=

  12. [12]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  13. [13]

    Muennighoff, Niklas and Soldaini, Luca and Groeneveld, Dirk and Lo, Kyle and Morrison, Jacob and Min, Sewon and Shi, Weijia and Walsh, Pete and Tafjord, Oyvind and Lambert, Nathan and others , journal=

  14. [14]

    Park, Jungwoo and Ahn, Young Jin and Kim, Kee-Eung and Kang, Jaewoo , journal=

  15. [15]

    Oldfield, James and Georgopoulos, Markos and Chrysos, Grigorios and Tzelepis, Christos and Panagakis, Yannis and Nicolaou, Mihalis and Deng, Jiankang and Patras, Ioannis , journal=

  16. [16]

    International Conference on Learning Representations , year=

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations , year=

  17. [17]

    Fedus, William and Zoph, Barret and Shazeer, Noam , journal=

  18. [18]

    2023 , howpublished =

    Language models can explain neurons in language models , author=. 2023 , howpublished =

  19. [19]

    2023 , eprint=

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

  20. [20]

    Distill , year =

    Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

  21. [21]

    2025 , eprint=

    Open Problems in Mechanistic Interpretability , author=. 2025 , eprint=

  22. [22]

    2025 , eprint=

    DeepSeek-V3 Technical Report , author=. 2025 , eprint=

  23. [23]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  24. [24]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  25. [25]

    2023 , eprint=

    Sparse Autoencoders Find Hinghly Interpretable Features in Language Models , author=. 2023 , eprint=

  26. [26]

    and Hume, Tristan and Carter, Shan and Henighan, Tom and Olah, Christopher , journal =

    Bricken, Trenton and Templeton, Adly and Batson, Joshua and Chen, Brian and Jermyn, Adam and Conerly, Tom and Turner, Nick and Anil, Cem and Denison, Carson and Askell, Amanda and Lasenby, Robert and Wu, Yifan and Kravec, Shauna and Schiefer, Nicholas and Maxwell, Tim and Joseph, Nicholas and Hatfield-Dodds, Zac and Tamkin, Alex and Nguyen, Karina and McL...

  27. [27]

    2021 , note =

    Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, ...

  28. [28]

    2025 , eprint=

    AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders , author=. 2025 , eprint=

  29. [29]

    2024 , eprint=

    Applying sparse autoencoders to unlearn knowledge in language models , author=. 2024 , eprint=

  30. [30]

    2025 , eprint=

    Are Sparse Autoencoders Useful? A Case Study in Sparse Probing , author=. 2025 , eprint=

  31. [31]

    2025 , eprint=

    Mixture of Experts Made Intrinsically Interpretable , author=. 2025 , eprint=

  32. [32]

    2025 , eprint=

    Probing Semantic Routing in Large Mixture-of-Expert Models , author=. 2025 , eprint=

  33. [33]

    2025 , eprint=

    Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    A Closer Look into Mixture-of-Experts in Large Language Models , author=. 2025 , eprint=

  35. [35]

    Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference , year=

    MoE Lens - An Expert Is All You Need , author=. Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference , year=

  36. [36]

    2025 , eprint=

    Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs , author=. 2025 , eprint=

  37. [37]

    Your Mixture-of-Experts

    Ziyue Li and Tianyi Zhou , booktitle=. Your Mixture-of-Experts. 2025 , url=

  38. [38]

    2024 , eprint=

    Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free , author=. 2024 , eprint=

  39. [39]

    2024 , eprint=

    The Remarkable Robustness of LLMs: Stages of Inference? , author=. 2024 , eprint=

  40. [40]

    2025 , eprint=

    Transformer Layers as Painters , author=. 2025 , eprint=

  41. [41]

    2025 , month = apr, howpublished =

  42. [42]

    2025 , eprint=

    On the Spatial Structure of Mixture-of-Experts in Transformers , author=. 2025 , eprint=

  43. [43]

    2024 , eprint=

    Part-Of-Speech Sensitivity of Routers in Mixture of Experts Models , author=. 2024 , eprint=

  44. [44]

    2025 , month = apr, howpublished =

    The Llama 4 herd: The beginning of a new era of natively multimodal. 2025 , month = apr, howpublished =

  45. [45]

    2025 , eprint=

    Semantic Specialization in MoE Appears with Scale: A Study of DeepSeek R1 Expert Specialization , author=. 2025 , eprint=

  46. [46]

    2025 , eprint=

    Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training , author=. 2025 , eprint=

  47. [47]

    2022 , eprint=

    ST-MoE: Designing Stable and Transferable Sparse Expert Models , author=. 2022 , eprint=

  48. [48]

    The Thirteenth International Conference on Learning Representations , year=

    Tight Clusters Make Specialized Experts , author=. The Thirteenth International Conference on Learning Representations , year=

  49. [49]

    2025 , eprint=

    An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) , author=. 2025 , eprint=

  50. [50]

    2025 , eprint=

    Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference , author=. 2025 , eprint=

  51. [51]

    2025 , eprint=

    Multilingual Routing in Mixture-of-Experts , author=. 2025 , eprint=

  52. [52]

    Artificial Analysis: AI Model & API Providers Analysis , howpublished =

  53. [53]

    Proceedings of the 57th Conference of the Association for Computational Linguistics,

    Tenney, Ian and Das, Dipanjan and Pavlick, Ellie. BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1452

  54. [54]

    Methods for Visual Understanding of Hierarchical System Structures , year=

    Sugiyama, Kozo and Tagawa, Shojiro and Toda, Mitsuhiko , journal=. Methods for Visual Understanding of Hierarchical System Structures , year=

  55. [55]

    2007 , publisher=

    Sugiyama, Kozo and Tagawa, Shojiro and Toda, Mitsuhiko , journal=. 2007 , publisher=

  56. [56]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  57. [57]

    2025 , eprint=

    gpt-oss-120b and gpt-oss-20b Model Card , author=. 2025 , eprint=

  58. [58]

    2025 , eprint=

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. 2025 , eprint=

  59. [59]

    2025 , eprint=

    OLMoE: Open Mixture-of-Experts Language Models , author=. 2025 , eprint=

  60. [60]

    2025 , eprint=

    Granite Code Models: A Family of Open Foundation Models for Code Intelligence , author=. 2025 , eprint=

  61. [61]

    2024 , eprint=

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=