pith. sign in

arxiv: 2605.16349 · v1 · pith:RTEBUJKAnew · submitted 2026-05-08 · 💻 cs.LG

Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap

Pith reviewed 2026-05-20 23:42 UTC · model grok-4.3

classification 💻 cs.LG
keywords Mixture-of-Expertsexpert specializationJacobian alignmentrepresentation subspacesrouting sparsityTransformer modelsgeometric analysisconditional computation
0
0 comments X

The pith

MoE experts in pretrained Transformers show near-zero functional correlation but only partial overlap in their representation subspaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a Jacobian-PCA-Grassmann framework to examine specialization inside Mixture-of-Experts layers of large Transformers. It reports that experts perform nearly uncorrelated computations, shown by consistently low cross-expert Jacobian alignment, yet the representations they receive and produce sit in subspaces that are distinct but still overlap. Experiments on Mistral and Qwen models indicate this asymmetry is stable, and that top-k routing sharpens the separation while softer routing increases entanglement. A reader would care because the result supplies a geometric account of why sparse routing can expand model capacity without forcing every expert to duplicate the same work. The framework also supplies a practical diagnostic for probing conditional computation in current architectures.

Core claim

Across pretrained MoE Transformers, experts exhibit strong functional decorrelation with near-zero cross-expert Jacobian alignment while their routed representations occupy distinct but partially overlapping subspaces. Functional decorrelation and representational overlap therefore coexist rather than coincide. Controlled routing experiments show that top-k routing produces sharper functional separation and larger subspace divergence, whereas fully soft routing yields more entangled expert structure. The results support viewing MoE layers as locally decorrelated operators acting over overlapping submanifolds on a shared representation manifold.

What carries the argument

The Jacobian-PCA-Grassmann framework, which quantifies functional decorrelation through cross-expert Jacobian alignment and representational overlap through subspace distances on the Grassmann manifold.

If this is right

  • Top-k routing sharpens functional separation and increases subspace divergence between experts.
  • Fully soft routing produces more entangled expert structure in both function and representation space.
  • MoE layers implement locally decorrelated operators over overlapping submanifolds on a shared representation manifold.
  • Routing sparsity is a primary driver of the observed geometric asymmetry in expert specialization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could deliberately adjust routing temperature or k to tune the desired balance between functional independence and representational sharing.
  • The partial overlap finding suggests that expert merging or pruning algorithms might safely combine experts whose subspaces are highly aligned without large performance loss.
  • The same measurement pipeline could be applied to study specialization in other conditional-computation architectures beyond standard MoE Transformers.

Load-bearing premise

The Jacobian-PCA-Grassmann measurements give a faithful and complete picture of expert specialization without needing confirmation from other metrics or causal interventions.

What would settle it

Finding high cross-expert Jacobian alignment or completely non-overlapping subspaces in additional pretrained MoE models would contradict the reported asymmetry.

Figures

Figures reproduced from arXiv: 2605.16349 by Feilong Liu.

Figure 1
Figure 1. Figure 1: Geometric structure of MoE specialization in Mistral-8×7B (Layer 16). (a) Cross￾expert Jacobian similarity matrix. (b) Distribution of Jacobian similarities. (c) Routed PCA spectra for dense vs. MoE expert layer. (d) Grassmannian distances between expert subspaces (top-5 components; theoretical maximum ≈ 3.51). Cross-expert alignment. Figures 1a-b show the cosine similarity between expert-local Jacobians i… view at source ↗
Figure 2
Figure 2. Figure 2: Geometric structure of MoE specialization in Qwen1.5-MoE-A2.7B (Layer 16). (a) Cross-expert Jacobian similarity matrix (first 10 experts shown). (b) Distribution of Jacobian similari￾ties. (c) Routed PCA spectra for dense vs. MoE expert layer. (d) Grassmannian distances between expert subspaces (top-5 components; theoretical maximum ≈ 3.51). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of routing sharpness on MoE geometry in the controlled 3-layer Transformer model. (a-b) Jacobian similarity matrix under fully-soft vs Top-k routing. (c-d) Distribution of Jacobian similarities under fully-soft vs Top-k routing. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of routing sharpness on MoE geometry in the controlled 3-layer Transformer model. (a–b) Grassmannian distance matrix under fully-soft vs. Top-k routing. (c–d) Distribution of Grassmannian distances under fully-soft vs. Top-k routing (theoretical maximum ≈ 3.51). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Mixture-of-Experts (MoE) architectures achieve scalable capacity through sparse routing, yet the geometric structure of expert specialization remains poorly understood. We introduce a unified Jacobian-PCA-Grassmann framework for analyzing MoE layers in both function space and representation space. Across pretrained MoE Transformers (Mistral, Qwen), we find a consistent structural asymmetry: experts exhibit strong functional decorrelation (consistently low, near-zero cross-expert Jacobian alignment) while their routed representations occupy distinct but partially overlapping subspaces. This indicates that functional decorrelation and representation overlap coexist rather than coincide in MoE specialization. Controlled routing experiments further indicate that routing sparsity appears to be a key factor shaping this geometry: top-k routing induces sharper functional separation and larger subspace divergence, whereas fully soft routing yields more entangled expert structure. Together, these results suggest a geometric interpretation in which MoE layers may be viewed as implementing locally decorrelated operators over overlapping submanifolds on a shared representation manifold, and provide a general diagnostic framework for studying conditional computation in modern Transformer architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a Jacobian-PCA-Grassmann framework for analyzing MoE layers in both function space (via cross-expert Jacobian alignment) and representation space (via PCA subspaces and Grassmann distances). Across pretrained models (Mistral, Qwen), it reports a consistent asymmetry: near-zero functional decorrelation coexisting with partial representational overlap in routed subspaces. Controlled experiments compare top-k versus soft routing to argue that sparsity drives sharper separation, leading to the interpretation of MoE layers as locally decorrelated operators over overlapping submanifolds.

Significance. If the framework and measurements prove robust, the work supplies a concrete geometric diagnostic for conditional computation in Transformers and highlights a non-obvious dissociation between functional and representational specialization. The use of real pretrained checkpoints rather than toy models is a positive feature; the controlled routing ablations, if cleanly isolated, could inform architecture choices. The absence of parameter fitting or self-referential definitions in the reported measurements is also a strength.

major comments (3)
  1. [Framework definition and §4 (experimental setup)] The central claim that functional decorrelation coexists with representational overlap rests on the Jacobian-PCA-Grassmann pipeline faithfully capturing both spaces. The manuscript does not report validation of Jacobian alignment against global function metrics (e.g., output correlation on held-out inputs) or alternative specialization measures, leaving open the possibility that the reported near-zero alignment reflects only local linear behavior at sampled points rather than the full expert mapping.
  2. [Controlled routing experiments] In the controlled routing experiments, routed representations are extracted conditionally on the same routing decisions used to define the subspaces. This introduces a potential circularity that the top-k versus soft comparison does not automatically resolve; an independent intervention (e.g., fixed random routing masks or post-hoc subspace projection) would be needed to establish sparsity as the causal driver.
  3. [Results and figures] The abstract and results claim 'consistent' low Jacobian alignment and 'partial overlap' across Mistral and Qwen, yet no error bars, layer-wise statistics, or sample-size details are referenced in the provided description. Without these, the strength of the cross-model generalization cannot be assessed.
minor comments (2)
  1. [Notation and methods] Clarify the precise sampling strategy for Jacobian estimation (number of points, input distribution) and the exact Grassmann distance formula employed.
  2. [Figures] Add random or shuffled-expert baselines to the Jacobian-alignment and subspace-overlap plots so that 'near-zero' and 'partial overlap' can be interpreted relative to chance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below, clarifying our methodological choices where appropriate and outlining planned revisions to improve clarity and robustness.

read point-by-point responses
  1. Referee: The central claim that functional decorrelation coexists with representational overlap rests on the Jacobian-PCA-Grassmann pipeline faithfully capturing both spaces. The manuscript does not report validation of Jacobian alignment against global function metrics (e.g., output correlation on held-out inputs) or alternative specialization measures, leaving open the possibility that the reported near-zero alignment reflects only local linear behavior at sampled points rather than the full expert mapping.

    Authors: We agree that explicit validation against global metrics would strengthen the interpretation of the Jacobian results. While the Jacobian alignment is chosen to probe local linear behavior around activation points (relevant for sparse expert routing), we will add a new subsection in the revised manuscript comparing cross-expert Jacobian alignment to direct output correlations on held-out inputs, as well as to an alternative measure based on expert output divergence. This will help confirm that the observed near-zero alignment generalizes beyond the local linear regime. revision: yes

  2. Referee: In the controlled routing experiments, routed representations are extracted conditionally on the same routing decisions used to define the subspaces. This introduces a potential circularity that the top-k versus soft comparison does not automatically resolve; an independent intervention (e.g., fixed random routing masks or post-hoc subspace projection) would be needed to establish sparsity as the causal driver.

    Authors: We appreciate the concern about potential circularity. The top-k versus soft comparison holds the model weights fixed while varying only the routing mechanism, allowing us to attribute geometric differences to sparsity level. Nevertheless, to more rigorously isolate causality, we will add an ablation using fixed random routing masks (independent of the learned router) and report the resulting subspace and Jacobian metrics. This will be included as an additional controlled experiment in the revised version. revision: yes

  3. Referee: The abstract and results claim 'consistent' low Jacobian alignment and 'partial overlap' across Mistral and Qwen, yet no error bars, layer-wise statistics, or sample-size details are referenced in the provided description. Without these, the strength of the cross-model generalization cannot be assessed.

    Authors: We agree that quantitative details on variability are necessary to support claims of consistency. In the revised manuscript we will augment the results section and figures with error bars (standard error across layers and input samples), layer-wise statistics (means and standard deviations), and explicit reporting of sample sizes and number of layers evaluated for each model. revision: yes

Circularity Check

0 steps flagged

No circularity: observational measurements via introduced framework

full rationale

The paper introduces a Jacobian-PCA-Grassmann framework as an analytical tool and applies it to measure functional decorrelation (via Jacobian alignment) and representational overlap (via PCA-Grassmann distances) in pretrained MoE models. These are direct empirical observations across models like Mistral and Qwen, with controlled routing experiments (top-k vs. soft) serving as interventions. No derivations reduce to fitted parameters by construction, no self-definitional loops, and no load-bearing self-citations or ansatz smuggling are present in the abstract or described chain. The results are self-contained empirical findings rather than predictions forced by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; the framework relies on standard linear-algebraic and manifold assumptions whose details are not elaborated.

axioms (1)
  • domain assumption Jacobian matrices and Grassmann distances on PCA subspaces faithfully capture functional and representational specialization.
    Invoked when defining the unified analysis framework for MoE layers.

pith-pipeline@v0.9.0 · 5707 in / 1262 out tokens · 42518 ms · 2026-05-20T23:42:06.568592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

  1. [1]

    Journal of Machine Learning Research , year=

    Switch Transformers: Scaling to Trillion-Parameter Models with Simple and Efficient Sparsity , author=. Journal of Machine Learning Research , year=

  2. [2]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. arXiv preprint arXiv:1701.06538 , year=

  3. [3]

    Proceedings of the International Conference on Learning Representations , year=

    GShard: Scaling Giant Models with Conditional Computation , author=. Proceedings of the International Conference on Learning Representations , year=

  4. [4]

    arXiv preprint arXiv:2402.07871 , year=

    Scaling Laws for Fine-Grained Mixture of Experts , author=. arXiv preprint arXiv:2402.07871 , year=

  5. [5]

    Advances in Neural Information Processing Systems , year=

    Attention Is All You Need , author=. Advances in Neural Information Processing Systems , year=

  6. [6]

    arXiv preprint arXiv:2506.08764 , year=

    On the Stability of the Jacobian Matrix in Deep Neural Networks , author=. arXiv preprint arXiv:2506.08764 , year=

  7. [7]

    arXiv preprint arXiv:2506.23266 , year=

    Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging , author=. arXiv preprint arXiv:2506.23266 , year=

  8. [8]

    arXiv preprint arXiv:2510.14436 , year=

    MergeMoE: Efficient Compression of MoE Models via Expert Output Merging , author=. arXiv preprint arXiv:2510.14436 , year=

  9. [9]

    2025 , journal=

    Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition , author=. 2025 , journal=

  10. [10]

    Yang, Cheng and Sui, Yang and Xiao, Jinqi and Huang, Lingyi and Gong, Yu and Duan, Yuanlin and Jia, Wenqi and Yin, Miao and Cheng, Yu and Yuan, Bo , journal=. MoE-I

  11. [11]

    Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a

    Mixture Compressor for Mixture-of-Experts LLMs Gains More , author=. arXiv preprint arXiv:2410.06270 , year=

  12. [12]

    SIAM Journal on Matrix Analysis and Applications , year=

    The Geometry of Algorithms with Orthogonality Constraints , author=. SIAM Journal on Matrix Analysis and Applications , year=

  13. [13]

    Optimization Algorithms on Matrix Manifolds , author=

  14. [14]

    Matrix Computations , author=

  15. [15]

    Neural Computation , volume=

    Adaptive Mixtures of Local Experts , author=. Neural Computation , volume=

  16. [16]

    arXiv preprint arXiv:2302.14703 , year=

    Improving Expert Specialization in Mixture of Experts , author=. arXiv preprint arXiv:2302.14703 , year=

  17. [17]

    arXiv preprint arXiv:2208.02813 , year=

    On the Representation Collapse of Sparse Mixture of Experts , author=. arXiv preprint arXiv:2208.02813 , year=

  18. [18]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    MoEC: Mixture of Expert Clusters , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  19. [19]

    arXiv preprint arXiv:2509.10513 , year=

    Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning , author=. arXiv preprint arXiv:2509.10513 , year=

  20. [20]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    ST-MoE: Designing Stable and Transferable Sparse Expert Models , author=. arXiv preprint arXiv:2202.08906 , year=

  21. [21]

    Proceedings of the International Joint Conference on Neural Networks , year=

    Hierarchical Mixtures of Experts and the EM Algorithm , author=. Proceedings of the International Joint Conference on Neural Networks , year=

  22. [22]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models , author=. arXiv preprint arXiv:2401.06066 , year=

  23. [23]

    Sensitivity and Generalization in Neural Networks: an Empirical Study

    Sensitivity and Generalization in Neural Networks: An Empirical Study , author=. arXiv preprint arXiv:1802.08760 , year=

  24. [24]

    Proceedings of the 21st International Conference on Artificial Intelligence and Statistics , year=

    The Emergence of Spectral Universality in Deep Networks , author=. Proceedings of the 21st International Conference on Artificial Intelligence and Statistics , year=

  25. [25]

    Proceedings of EMNLP , year=

    How Contextual Are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , author=. Proceedings of EMNLP , year=

  26. [26]

    Proceedings of EMNLP , year=

    All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality , author=. Proceedings of EMNLP , year=