pith. sign in

arxiv: 2605.30992 · v1 · pith:KELUQXQWnew · submitted 2026-05-29 · 💻 cs.LG

Eigenvectors of Experts are Training-free Non-collapsing Routers

Pith reviewed 2026-06-28 23:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords Sparse Mixture of ExpertsExpert CollapseTraining-free RoutingSingular Value DecompositionEigenvectorsSMoERouter DesignSpectral Analysis
0
0 comments X

The pith

Eigenvectors of expert weight matrices encode semantic information that can be used as a training-free router to prevent collapse in SMoE models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in advanced Sparse Mixture of Experts models, the eigenvectors of the expert weight matrices already contain rich semantic details about tokens. This observation leads to SSMoE, a method that applies singular value decomposition directly to those weights to derive routing decisions. SSMoE requires no additional training or fine-tuning yet reduces the expert collapse problem that hurts both training and inference. Experiments on language and vision tasks, including corrupted data, indicate that the approach maintains or improves performance compared with trained routers. The work argues that internal spectral properties of pretrained experts offer a practical alternative to conventional routing strategies.

Core claim

In advanced SMoE models the eigenvectors of expert weight matrices encode rich semantic information; singular value decomposition of those matrices therefore supplies routing scores that avoid expert collapse without any training or fine-tuning, yielding a framework called SSMoE whose performance holds across language and vision tasks under both clean and corrupt conditions.

What carries the argument

Singular Value Decomposition applied to expert weight matrices, whose eigenvectors supply the routing decisions.

If this is right

  • SSMoE can be applied directly to existing pretrained SMoE models without further optimization.
  • Routing derived from spectral properties reduces collapse on both language and vision tasks.
  • The same eigenvectors remain effective under clean and corrupted input data.
  • No additional training data or compute is required to obtain the router.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The finding suggests that pretrained expert matrices already embed task-relevant structure that conventional learned routers must rediscover.
  • Similar spectral analysis could be tested on other modular architectures such as mixture-of-experts variants in vision transformers.
  • If eigenvectors prove stable, they might serve as a diagnostic tool to inspect what semantic distinctions each expert has learned.

Load-bearing premise

The semantic encoding observed in the eigenvectors of expert weight matrices is stable enough to produce reliable routing decisions across models, tasks, and data distributions without retraining.

What would settle it

Run SSMoE on a new pretrained SMoE model on a standard benchmark and measure whether expert usage still collapses to a small subset of experts.

Figures

Figures reproduced from arXiv: 2605.30992 by Giang Do, Hung Le, Truyen Tran.

Figure 1
Figure 1. Figure 1: Average router collapse levels across layers for ten state￾of-the-art MoE-based LLMs. The results demonstrate that while all models exhibit router collapse, the intensity varies between reasoning and non-reasoning models. Best viewed in color. We also evaluate recent large-scale models such as GPT￾OSS-20B, GPT-OSS-120B (OpenAI et al., 2025), ERNIE￾4.5 (Baidu-ERNIE-Team, 2025), and the Qwen3-MoE series (Ins… view at source ↗
Figure 2
Figure 2. Figure 2: We report performance comparisons across benchmarks for different GPT-OSS model scales under the 5-shot evaluation setting. The proposed SSMoE consistently outperforms all baseline methods across the eight datasets, achieving an average improvement of approximately 13%. Notably, SSMoE surpasses the original GPT-OSS models with an average gain of about 6%, while simultaneously reducing memory consumption by… view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of our Eigenvectors Representation, which leverages enriched information from the eigenvectors of expert weights. In contrast, SMoE employs a learnable expert embedding (router) to select the top-k experts for each token. SSMoE provides an efficient and robust representation, as demonstrated in Section 4. Best viewed in color. Step 3: Spectral Routing Matrix. Concatenate the expert spectral… view at source ↗
Figure 4
Figure 4. Figure 4: We compare the performance of the Eigenvector represen￾tation (EV) with traditional routers and OLMoE-7B (Muennighoff et al., 2025) on the Massive Text Embedding Benchmark (MTEB). The results show that EV outperforms OLMoE-7B on 37% of the tasks across six benchmarks while reducing computational cost by 5% in GFLOPs. These findings suggest that EV captures rich semantic information. Best viewed in color. O… view at source ↗
Figure 5
Figure 5. Figure 5: Average performance on MTEB tasks across three ad￾vanced LLMs (OLMoE-7B, Qwen-MoE-7B, and DeepSeekMoE￾16B), comparing SSMoE, SMoE, Eigenvectors (EV), and MoEE using PromptEOL (Jiang et al., 2023) for in-context learning eval￾uation. Best viewed in color. In-context Learning Evaluation [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: b, which is theoretically supported by Lemma B.1. Latent Structure Discovery [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE visualizations of two learned representations, col￾ored by KMeans cluster assignments with k = 64. SSMoE (left) displays more compact and well-separated clusters than SMoE (right), indicating stronger underlying cluster structure. Clustering quality is further supported by quantitative metrics: higher Silhou￾ette and lower Davies-Bouldin index for SSMoE. 5. Conclusion In this research, we investigate… view at source ↗
read the original abstract

Sparse Mixture of Experts (SMoE) architectures improve the training efficiency of Large Language Models (LLMs) by routing input tokens to a selected subset of specialized experts. Despite their remarkable success, both training and inference in SMoE models suffer from the expert collapse issue (Chi et al., 2022), which degrades model performance. Prior studies primarily focus on improving the router; however, such methods rely on training from scratch or fine-tuning, which requires high computational and data-processing costs. Furthermore, we demonstrate that, despite these efforts, the issue persists when advancing well-pretrained SMoE models, as evidenced by both theoretical and empirical results. To fill that gap, we analyze the advanced SMoE models and observe that the eigenvectors of expert weight matrices encode rich semantic information, pointing to an effective alternative to conventional routing strategies. Building on this insight, we propose Singular Value Decomposition SMoE (SSMoE), a novel and training-free framework that leverages spectral properties of the expert weights to address the collapse issue and enhance model performance. Extensive experiments across diverse language and vision tasks, under both clean and corrupt data settings, demonstrate the strong generalization and robustness of SSMoE. Our findings highlight how a deeper understanding of model internals can guide the development of more effective SMoE architectures. Our implementation is publicly available at https://github.com/giangdip2410/SSMoE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript observes that eigenvectors of expert weight matrices in pretrained SMoE models encode rich semantic information. It proposes SSMoE, a training-free router that applies SVD to these weights to generate routing decisions, claiming this mitigates expert collapse (which the authors assert persists even after prior router improvements) and improves performance. The approach is evaluated on language and vision tasks under clean and corrupted data, with public code released.

Significance. If the central observation and construction hold, the work supplies a low-cost, training-free alternative to learned routers in SMoE architectures, potentially lowering fine-tuning overhead while improving load balance and robustness. The public implementation is a clear strength that supports reproducibility.

major comments (3)
  1. [Abstract] Abstract: the assertion that collapse 'persists when advancing well-pretrained SMoE models, as evidenced by both theoretical and empirical results' is load-bearing for the motivation, yet the text supplies neither equations, theorems, nor any experimental protocol or metric quantifying this persistence.
  2. The construction of the SSMoE router is presented as leveraging 'spectral properties' to produce non-collapsing routing logits, but no derivation or argument is given showing why the top eigenvectors of W_e align with token-expert affinity directions learned during pretraining or remain stable under distribution shift; the skeptic concern that this may reduce to a static heuristic therefore cannot be assessed.
  3. The weakest assumption—that the observed semantic encoding is sufficiently general and invariant to expert specialization and data shifts—is stated but not tested with any controlled ablation (e.g., across different pretraining distributions or expert counts) that would be required to support the 'strong generalization' claim.
minor comments (1)
  1. [Abstract] The abstract is unusually dense and would benefit from a single sentence clarifying the exact routing rule (e.g., how many leading singular vectors are retained and how they are combined into logits).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below, clarifying the manuscript's claims and indicating where revisions will strengthen the presentation. Our responses focus on substance and aim to improve clarity without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that collapse 'persists when advancing well-pretrained SMoE models, as evidenced by both theoretical and empirical results' is load-bearing for the motivation, yet the text supplies neither equations, theorems, nor any experimental protocol or metric quantifying this persistence.

    Authors: We agree that the abstract statement would benefit from explicit pointers to the supporting material. Section 3 of the manuscript contains the theoretical analysis, including equations and a proof sketch showing that standard router improvements do not eliminate collapse in pretrained models. Section 4 presents the corresponding empirical protocol, using load-balance metrics and performance degradation under continued training. To make this immediately visible, we will revise the abstract to reference these sections and briefly note the key metrics employed. revision: yes

  2. Referee: [—] The construction of the SSMoE router is presented as leveraging 'spectral properties' to produce non-collapsing routing logits, but no derivation or argument is given showing why the top eigenvectors of W_e align with token-expert affinity directions learned during pretraining or remain stable under distribution shift; the skeptic concern that this may reduce to a static heuristic therefore cannot be assessed.

    Authors: The manuscript presents the alignment as an empirical observation supported by visualization and downstream performance, rather than a formal derivation. We will add a short explanatory paragraph in Section 3 that connects the top eigenvectors to the principal directions of expert specialization observed during pretraining, drawing on the fact that expert weight matrices encode token-feature covariances. While a full stability proof under arbitrary shifts is beyond the current scope, we will include a brief discussion of why the leading singular vectors are expected to be more robust than learned routers. This addition will allow readers to evaluate the heuristic concern directly. revision: partial

  3. Referee: [—] The weakest assumption—that the observed semantic encoding is sufficiently general and invariant to expert specialization and data shifts—is stated but not tested with any controlled ablation (e.g., across different pretraining distributions or expert counts) that would be required to support the 'strong generalization' claim.

    Authors: We acknowledge that the generalization claim would be stronger with additional controlled ablations. The current experiments already span multiple model scales, modalities, and both clean and corrupted data, but they do not systematically vary pretraining corpora or expert counts. We will add a new ablation subsection that reports results for models pretrained on different data distributions and with varying numbers of experts, thereby directly testing the invariance assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation leads to heuristic method without self-referential reduction

full rationale

The paper's core contribution is an empirical observation that eigenvectors of expert weight matrices encode semantic information, followed by the proposal of a training-free SSMoE router based on SVD. No load-bearing step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional equivalence. The abstract and described approach treat the spectral property as an observed fact used to motivate a new construction, with performance validated by experiments rather than derived by construction from the inputs. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5781 in / 1071 out tokens · 24508 ms · 2026-06-28T23:24:31.654343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclantho logy.org/N19-1300/. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge,

  2. [2]

    Training Verifiers to Solve Math Word Problems

    URL https://arxiv.org/abs/1803.0 5457. Coates, A., Ng, A., and Lee, H. An analysis of single- layer networks in unsupervised feature learning. In Gor- don, G., Dunson, D., and Dud ´ık, M. (eds.),Proceed- ings of the Fourteenth International Conference on Arti- ficial Intelligence and Statistics, volume 15 ofProceed- ings of Machine Learning Research, pp. ...

  3. [3]

    Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F

    URL https://arxiv.org/abs/2504.0 5342. Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F. StableMoE: Stable routing strategy for mixture of experts. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.),Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7085–7095, Dubli...

  4. [4]

    doi: 10.18653/v1/2022.acl-long.489

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.489. URL https://acla nthology.org/2022.acl-long.489/. Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y ., Huang, P., Luo, 10 Eigenvectors of Experts are Training-free Non-collapsing Routers F., Ruan, C., Sui, Z., and Liang, W...

  5. [5]

    Dutta, A., Krishnan, S., Kwatra, N., and Ramjee, R

    URL https://proceedings.mlr.pres s/v162/du22c.html. Dutta, A., Krishnan, S., Kwatra, N., and Ramjee, R. Accu- racy is not all you need. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Sys- tems, volume 37, pp. 124347–124390. Curran Associates, Inc., 2024. doi: 10....

  6. [6]

    Jiang, T., Huang, S., Luan, Z., Wang, D., and Zhuang, F

    URL https://arxiv.org/abs/2307.1 6645. Jiang, T., Huang, S., Luan, Z., Wang, D., and Zhuang, F. Scaling sentence embeddings with large language models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), Findings of the Association for Computational Linguis- tics: EMNLP 2024, pp. 3182–3196, Miami, Florida, USA, November 2024. Association for Computat...

  7. [7]

    URL https://doi.org/10.1109/TPAMI.2025.3 532688

    doi: 10.1109/TPAMI.2025.3532688. URL https://doi.org/10.1109/TPAMI.2025.3 532688. Li, Z. and Zhou, T. Your mixture-of-experts LLM is se- cretly an embedding model for free. InThe Thirteenth International Conference on Learning Representations,

  8. [8]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    URL https://openreview.net/forum ?id=eFGQ97z5Cd. Li, Z., Liang, C., Zhang, Z., Hong, I., Kim, Y . J., Chen, W., and Zhao, T. Slimmoe: Structured compression of large moe models via expert slimming and distillation. In Second Conference on Language Modeling, 2025b. URL https://openreview.net/forum?id=oaCU sn391F. Lin, T.-Y ., Maire, M., Belongie, S., Hays,...

  9. [9]

    gpt-oss-120b & gpt-oss-20b Model Card

    URL https://openreview.net/forum ?id=Pu3c0209cx. OpenAI, :, Agarwal, S., Ahmad, L., Ai, J., Altman, S., Ap- plebaum, A., Arbus, E., Arora, R. K., Bai, Y ., Baker, B., Bao, H., Barak, B., Bennett, A., Bertao, T., Brett, N., Brevdo, E., Brockman, G., Bubeck, S., Chang, C., Chen, K., Chen, M., Cheung, E., Clark, A., Cook, D., Dukhan, M., Dvorak, C., Fives, K...

  10. [10]

    In: 2015 IEEE International Conference on Computer Vision (ICCV)

    doi: 10.1109/ICCV.2015.303. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.),Proceedings of the 38th International Conference on Machine Learning, vol...

  11. [11]

    Eigenvectors of Experts are Training-free Non-collapsing Routers

    URL https://openreview.net/forum ?id=B1ckMDqlg. Shen, L., Chen, G., Shao, R., Guan, W., and Nie, L. Mome: Mixture of multimodal experts for generalist multimodal large language models. In Globerson, A., Mackey, L., Bel- grave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Sys- tems, volume 37, pp. 420...