Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap
Pith reviewed 2026-05-20 23:42 UTC · model grok-4.3
The pith
MoE experts in pretrained Transformers show near-zero functional correlation but only partial overlap in their representation subspaces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across pretrained MoE Transformers, experts exhibit strong functional decorrelation with near-zero cross-expert Jacobian alignment while their routed representations occupy distinct but partially overlapping subspaces. Functional decorrelation and representational overlap therefore coexist rather than coincide. Controlled routing experiments show that top-k routing produces sharper functional separation and larger subspace divergence, whereas fully soft routing yields more entangled expert structure. The results support viewing MoE layers as locally decorrelated operators acting over overlapping submanifolds on a shared representation manifold.
What carries the argument
The Jacobian-PCA-Grassmann framework, which quantifies functional decorrelation through cross-expert Jacobian alignment and representational overlap through subspace distances on the Grassmann manifold.
If this is right
- Top-k routing sharpens functional separation and increases subspace divergence between experts.
- Fully soft routing produces more entangled expert structure in both function and representation space.
- MoE layers implement locally decorrelated operators over overlapping submanifolds on a shared representation manifold.
- Routing sparsity is a primary driver of the observed geometric asymmetry in expert specialization.
Where Pith is reading between the lines
- Designers could deliberately adjust routing temperature or k to tune the desired balance between functional independence and representational sharing.
- The partial overlap finding suggests that expert merging or pruning algorithms might safely combine experts whose subspaces are highly aligned without large performance loss.
- The same measurement pipeline could be applied to study specialization in other conditional-computation architectures beyond standard MoE Transformers.
Load-bearing premise
The Jacobian-PCA-Grassmann measurements give a faithful and complete picture of expert specialization without needing confirmation from other metrics or causal interventions.
What would settle it
Finding high cross-expert Jacobian alignment or completely non-overlapping subspaces in additional pretrained MoE models would contradict the reported asymmetry.
Figures
read the original abstract
Mixture-of-Experts (MoE) architectures achieve scalable capacity through sparse routing, yet the geometric structure of expert specialization remains poorly understood. We introduce a unified Jacobian-PCA-Grassmann framework for analyzing MoE layers in both function space and representation space. Across pretrained MoE Transformers (Mistral, Qwen), we find a consistent structural asymmetry: experts exhibit strong functional decorrelation (consistently low, near-zero cross-expert Jacobian alignment) while their routed representations occupy distinct but partially overlapping subspaces. This indicates that functional decorrelation and representation overlap coexist rather than coincide in MoE specialization. Controlled routing experiments further indicate that routing sparsity appears to be a key factor shaping this geometry: top-k routing induces sharper functional separation and larger subspace divergence, whereas fully soft routing yields more entangled expert structure. Together, these results suggest a geometric interpretation in which MoE layers may be viewed as implementing locally decorrelated operators over overlapping submanifolds on a shared representation manifold, and provide a general diagnostic framework for studying conditional computation in modern Transformer architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a Jacobian-PCA-Grassmann framework for analyzing MoE layers in both function space (via cross-expert Jacobian alignment) and representation space (via PCA subspaces and Grassmann distances). Across pretrained models (Mistral, Qwen), it reports a consistent asymmetry: near-zero functional decorrelation coexisting with partial representational overlap in routed subspaces. Controlled experiments compare top-k versus soft routing to argue that sparsity drives sharper separation, leading to the interpretation of MoE layers as locally decorrelated operators over overlapping submanifolds.
Significance. If the framework and measurements prove robust, the work supplies a concrete geometric diagnostic for conditional computation in Transformers and highlights a non-obvious dissociation between functional and representational specialization. The use of real pretrained checkpoints rather than toy models is a positive feature; the controlled routing ablations, if cleanly isolated, could inform architecture choices. The absence of parameter fitting or self-referential definitions in the reported measurements is also a strength.
major comments (3)
- [Framework definition and §4 (experimental setup)] The central claim that functional decorrelation coexists with representational overlap rests on the Jacobian-PCA-Grassmann pipeline faithfully capturing both spaces. The manuscript does not report validation of Jacobian alignment against global function metrics (e.g., output correlation on held-out inputs) or alternative specialization measures, leaving open the possibility that the reported near-zero alignment reflects only local linear behavior at sampled points rather than the full expert mapping.
- [Controlled routing experiments] In the controlled routing experiments, routed representations are extracted conditionally on the same routing decisions used to define the subspaces. This introduces a potential circularity that the top-k versus soft comparison does not automatically resolve; an independent intervention (e.g., fixed random routing masks or post-hoc subspace projection) would be needed to establish sparsity as the causal driver.
- [Results and figures] The abstract and results claim 'consistent' low Jacobian alignment and 'partial overlap' across Mistral and Qwen, yet no error bars, layer-wise statistics, or sample-size details are referenced in the provided description. Without these, the strength of the cross-model generalization cannot be assessed.
minor comments (2)
- [Notation and methods] Clarify the precise sampling strategy for Jacobian estimation (number of points, input distribution) and the exact Grassmann distance formula employed.
- [Figures] Add random or shuffled-expert baselines to the Jacobian-alignment and subspace-overlap plots so that 'near-zero' and 'partial overlap' can be interpreted relative to chance.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below, clarifying our methodological choices where appropriate and outlining planned revisions to improve clarity and robustness.
read point-by-point responses
-
Referee: The central claim that functional decorrelation coexists with representational overlap rests on the Jacobian-PCA-Grassmann pipeline faithfully capturing both spaces. The manuscript does not report validation of Jacobian alignment against global function metrics (e.g., output correlation on held-out inputs) or alternative specialization measures, leaving open the possibility that the reported near-zero alignment reflects only local linear behavior at sampled points rather than the full expert mapping.
Authors: We agree that explicit validation against global metrics would strengthen the interpretation of the Jacobian results. While the Jacobian alignment is chosen to probe local linear behavior around activation points (relevant for sparse expert routing), we will add a new subsection in the revised manuscript comparing cross-expert Jacobian alignment to direct output correlations on held-out inputs, as well as to an alternative measure based on expert output divergence. This will help confirm that the observed near-zero alignment generalizes beyond the local linear regime. revision: yes
-
Referee: In the controlled routing experiments, routed representations are extracted conditionally on the same routing decisions used to define the subspaces. This introduces a potential circularity that the top-k versus soft comparison does not automatically resolve; an independent intervention (e.g., fixed random routing masks or post-hoc subspace projection) would be needed to establish sparsity as the causal driver.
Authors: We appreciate the concern about potential circularity. The top-k versus soft comparison holds the model weights fixed while varying only the routing mechanism, allowing us to attribute geometric differences to sparsity level. Nevertheless, to more rigorously isolate causality, we will add an ablation using fixed random routing masks (independent of the learned router) and report the resulting subspace and Jacobian metrics. This will be included as an additional controlled experiment in the revised version. revision: yes
-
Referee: The abstract and results claim 'consistent' low Jacobian alignment and 'partial overlap' across Mistral and Qwen, yet no error bars, layer-wise statistics, or sample-size details are referenced in the provided description. Without these, the strength of the cross-model generalization cannot be assessed.
Authors: We agree that quantitative details on variability are necessary to support claims of consistency. In the revised manuscript we will augment the results section and figures with error bars (standard error across layers and input samples), layer-wise statistics (means and standard deviations), and explicit reporting of sample sizes and number of layers evaluated for each model. revision: yes
Circularity Check
No circularity: observational measurements via introduced framework
full rationale
The paper introduces a Jacobian-PCA-Grassmann framework as an analytical tool and applies it to measure functional decorrelation (via Jacobian alignment) and representational overlap (via PCA-Grassmann distances) in pretrained MoE models. These are direct empirical observations across models like Mistral and Qwen, with controlled routing experiments (top-k vs. soft) serving as interventions. No derivations reduce to fitted parameters by construction, no self-definitional loops, and no load-bearing self-citations or ansatz smuggling are present in the abstract or described chain. The results are self-contained empirical findings rather than predictions forced by the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Jacobian matrices and Grassmann distances on PCA subspaces faithfully capture functional and representational specialization.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a unified Jacobian-PCA-Grassmann framework... cross-expert Jacobian alignment... Grassmannian Geodesic Distance dG(ei, ej)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
routing sparsity appears to be a key factor shaping this geometry
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Journal of Machine Learning Research , year=
Switch Transformers: Scaling to Trillion-Parameter Models with Simple and Efficient Sparsity , author=. Journal of Machine Learning Research , year=
-
[2]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. arXiv preprint arXiv:1701.06538 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Proceedings of the International Conference on Learning Representations , year=
GShard: Scaling Giant Models with Conditional Computation , author=. Proceedings of the International Conference on Learning Representations , year=
-
[4]
arXiv preprint arXiv:2402.07871 , year=
Scaling Laws for Fine-Grained Mixture of Experts , author=. arXiv preprint arXiv:2402.07871 , year=
-
[5]
Advances in Neural Information Processing Systems , year=
Attention Is All You Need , author=. Advances in Neural Information Processing Systems , year=
-
[6]
arXiv preprint arXiv:2506.08764 , year=
On the Stability of the Jacobian Matrix in Deep Neural Networks , author=. arXiv preprint arXiv:2506.08764 , year=
-
[7]
arXiv preprint arXiv:2506.23266 , year=
Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging , author=. arXiv preprint arXiv:2506.23266 , year=
-
[8]
arXiv preprint arXiv:2510.14436 , year=
MergeMoE: Efficient Compression of MoE Models via Expert Output Merging , author=. arXiv preprint arXiv:2510.14436 , year=
-
[9]
Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition , author=. 2025 , journal=
work page 2025
-
[10]
Yang, Cheng and Sui, Yang and Xiao, Jinqi and Huang, Lingyi and Gong, Yu and Duan, Yuanlin and Jia, Wenqi and Yin, Miao and Cheng, Yu and Yuan, Bo , journal=. MoE-I
-
[11]
Mixture compressor for mixture-of-experts llms gains more.arXiv preprint arXiv:2410.06270, 2024a
Mixture Compressor for Mixture-of-Experts LLMs Gains More , author=. arXiv preprint arXiv:2410.06270 , year=
-
[12]
SIAM Journal on Matrix Analysis and Applications , year=
The Geometry of Algorithms with Orthogonality Constraints , author=. SIAM Journal on Matrix Analysis and Applications , year=
-
[13]
Optimization Algorithms on Matrix Manifolds , author=
-
[14]
Matrix Computations , author=
-
[15]
Adaptive Mixtures of Local Experts , author=. Neural Computation , volume=
-
[16]
arXiv preprint arXiv:2302.14703 , year=
Improving Expert Specialization in Mixture of Experts , author=. arXiv preprint arXiv:2302.14703 , year=
-
[17]
arXiv preprint arXiv:2208.02813 , year=
On the Representation Collapse of Sparse Mixture of Experts , author=. arXiv preprint arXiv:2208.02813 , year=
-
[18]
Proceedings of the AAAI Conference on Artificial Intelligence , year=
MoEC: Mixture of Expert Clusters , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=
-
[19]
arXiv preprint arXiv:2509.10513 , year=
Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning , author=. arXiv preprint arXiv:2509.10513 , year=
-
[20]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE: Designing Stable and Transferable Sparse Expert Models , author=. arXiv preprint arXiv:2202.08906 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Proceedings of the International Joint Conference on Neural Networks , year=
Hierarchical Mixtures of Experts and the EM Algorithm , author=. Proceedings of the International Joint Conference on Neural Networks , year=
-
[22]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models , author=. arXiv preprint arXiv:2401.06066 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Sensitivity and Generalization in Neural Networks: an Empirical Study
Sensitivity and Generalization in Neural Networks: An Empirical Study , author=. arXiv preprint arXiv:1802.08760 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Proceedings of the 21st International Conference on Artificial Intelligence and Statistics , year=
The Emergence of Spectral Universality in Deep Networks , author=. Proceedings of the 21st International Conference on Artificial Intelligence and Statistics , year=
-
[25]
How Contextual Are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , author=. Proceedings of EMNLP , year=
-
[26]
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality , author=. Proceedings of EMNLP , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.