Eigenvectors of Experts are Training-free Non-collapsing Routers

Giang Do; Hung Le; Truyen Tran

arxiv: 2605.30992 · v1 · pith:KELUQXQWnew · submitted 2026-05-29 · 💻 cs.LG

Eigenvectors of Experts are Training-free Non-collapsing Routers

Giang Do , Hung Le , Truyen Tran This is my paper

Pith reviewed 2026-06-28 23:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords Sparse Mixture of ExpertsExpert CollapseTraining-free RoutingSingular Value DecompositionEigenvectorsSMoERouter DesignSpectral Analysis

0 comments

The pith

Eigenvectors of expert weight matrices encode semantic information that can be used as a training-free router to prevent collapse in SMoE models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in advanced Sparse Mixture of Experts models, the eigenvectors of the expert weight matrices already contain rich semantic details about tokens. This observation leads to SSMoE, a method that applies singular value decomposition directly to those weights to derive routing decisions. SSMoE requires no additional training or fine-tuning yet reduces the expert collapse problem that hurts both training and inference. Experiments on language and vision tasks, including corrupted data, indicate that the approach maintains or improves performance compared with trained routers. The work argues that internal spectral properties of pretrained experts offer a practical alternative to conventional routing strategies.

Core claim

In advanced SMoE models the eigenvectors of expert weight matrices encode rich semantic information; singular value decomposition of those matrices therefore supplies routing scores that avoid expert collapse without any training or fine-tuning, yielding a framework called SSMoE whose performance holds across language and vision tasks under both clean and corrupt conditions.

What carries the argument

Singular Value Decomposition applied to expert weight matrices, whose eigenvectors supply the routing decisions.

If this is right

SSMoE can be applied directly to existing pretrained SMoE models without further optimization.
Routing derived from spectral properties reduces collapse on both language and vision tasks.
The same eigenvectors remain effective under clean and corrupted input data.
No additional training data or compute is required to obtain the router.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finding suggests that pretrained expert matrices already embed task-relevant structure that conventional learned routers must rediscover.
Similar spectral analysis could be tested on other modular architectures such as mixture-of-experts variants in vision transformers.
If eigenvectors prove stable, they might serve as a diagnostic tool to inspect what semantic distinctions each expert has learned.

Load-bearing premise

The semantic encoding observed in the eigenvectors of expert weight matrices is stable enough to produce reliable routing decisions across models, tasks, and data distributions without retraining.

What would settle it

Run SSMoE on a new pretrained SMoE model on a standard benchmark and measure whether expert usage still collapses to a small subset of experts.

Figures

Figures reproduced from arXiv: 2605.30992 by Giang Do, Hung Le, Truyen Tran.

**Figure 1.** Figure 1: Average router collapse levels across layers for ten stateof-the-art MoE-based LLMs. The results demonstrate that while all models exhibit router collapse, the intensity varies between reasoning and non-reasoning models. Best viewed in color. We also evaluate recent large-scale models such as GPTOSS-20B, GPT-OSS-120B (OpenAI et al., 2025), ERNIE4.5 (Baidu-ERNIE-Team, 2025), and the Qwen3-MoE series (Ins… view at source ↗

**Figure 2.** Figure 2: We report performance comparisons across benchmarks for different GPT-OSS model scales under the 5-shot evaluation setting. The proposed SSMoE consistently outperforms all baseline methods across the eight datasets, achieving an average improvement of approximately 13%. Notably, SSMoE surpasses the original GPT-OSS models with an average gain of about 6%, while simultaneously reducing memory consumption by… view at source ↗

**Figure 3.** Figure 3: An illustration of our Eigenvectors Representation, which leverages enriched information from the eigenvectors of expert weights. In contrast, SMoE employs a learnable expert embedding (router) to select the top-k experts for each token. SSMoE provides an efficient and robust representation, as demonstrated in Section 4. Best viewed in color. Step 3: Spectral Routing Matrix. Concatenate the expert spectral… view at source ↗

**Figure 4.** Figure 4: We compare the performance of the Eigenvector representation (EV) with traditional routers and OLMoE-7B (Muennighoff et al., 2025) on the Massive Text Embedding Benchmark (MTEB). The results show that EV outperforms OLMoE-7B on 37% of the tasks across six benchmarks while reducing computational cost by 5% in GFLOPs. These findings suggest that EV captures rich semantic information. Best viewed in color. O… view at source ↗

**Figure 5.** Figure 5: Average performance on MTEB tasks across three advanced LLMs (OLMoE-7B, Qwen-MoE-7B, and DeepSeekMoE16B), comparing SSMoE, SMoE, Eigenvectors (EV), and MoEE using PromptEOL (Jiang et al., 2023) for in-context learning evaluation. Best viewed in color. In-context Learning Evaluation [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: b, which is theoretically supported by Lemma B.1. Latent Structure Discovery [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: t-SNE visualizations of two learned representations, colored by KMeans cluster assignments with k = 64. SSMoE (left) displays more compact and well-separated clusters than SMoE (right), indicating stronger underlying cluster structure. Clustering quality is further supported by quantitative metrics: higher Silhouette and lower Davies-Bouldin index for SSMoE. 5. Conclusion In this research, we investigate… view at source ↗

read the original abstract

Sparse Mixture of Experts (SMoE) architectures improve the training efficiency of Large Language Models (LLMs) by routing input tokens to a selected subset of specialized experts. Despite their remarkable success, both training and inference in SMoE models suffer from the expert collapse issue (Chi et al., 2022), which degrades model performance. Prior studies primarily focus on improving the router; however, such methods rely on training from scratch or fine-tuning, which requires high computational and data-processing costs. Furthermore, we demonstrate that, despite these efforts, the issue persists when advancing well-pretrained SMoE models, as evidenced by both theoretical and empirical results. To fill that gap, we analyze the advanced SMoE models and observe that the eigenvectors of expert weight matrices encode rich semantic information, pointing to an effective alternative to conventional routing strategies. Building on this insight, we propose Singular Value Decomposition SMoE (SSMoE), a novel and training-free framework that leverages spectral properties of the expert weights to address the collapse issue and enhance model performance. Extensive experiments across diverse language and vision tasks, under both clean and corrupt data settings, demonstrate the strong generalization and robustness of SSMoE. Our findings highlight how a deeper understanding of model internals can guide the development of more effective SMoE architectures. Our implementation is publicly available at https://github.com/giangdip2410/SSMoE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SSMoE uses SVD on expert weights for a training-free router that claims to fix collapse, but the abstract leaves the construction and stability claims hard to verify.

read the letter

The main takeaway is that this paper observes eigenvectors of expert weight matrices in pretrained SMoE models appear to carry semantic structure, then builds a training-free router from their SVD to route tokens without collapse. The method is called SSMoE and is tested on language and vision tasks plus corrupted inputs.

What the work does is shift the focus from training better routers to extracting routing signals directly from the frozen expert matrices. Experiments cover multiple tasks and data conditions, and the code is released, which lets others check whether the reported gains hold up. That is concrete and useful for the subfield.

The soft spot is the gap between the observation and the router construction. The abstract states the eigenvectors encode rich semantics but does not show the exact mapping to routing logits or any argument that these directions remain aligned with token-expert affinities after distribution shift. If the top eigenvectors mostly reflect weight norms or low-level statistics rather than the learned specialization, the fixed router could still under-use experts on new data. The stress-test note on invariance therefore lands until the full derivation and ablations are examined.

This is for researchers already working on MoE scaling who want low-overhead alternatives to router fine-tuning. It deserves a serious referee because the angle is distinct from the training-based fixes in the cited literature and the empirical scope is reasonably broad, even if the theoretical link needs tightening.

Referee Report

3 major / 1 minor

Summary. The manuscript observes that eigenvectors of expert weight matrices in pretrained SMoE models encode rich semantic information. It proposes SSMoE, a training-free router that applies SVD to these weights to generate routing decisions, claiming this mitigates expert collapse (which the authors assert persists even after prior router improvements) and improves performance. The approach is evaluated on language and vision tasks under clean and corrupted data, with public code released.

Significance. If the central observation and construction hold, the work supplies a low-cost, training-free alternative to learned routers in SMoE architectures, potentially lowering fine-tuning overhead while improving load balance and robustness. The public implementation is a clear strength that supports reproducibility.

major comments (3)

[Abstract] Abstract: the assertion that collapse 'persists when advancing well-pretrained SMoE models, as evidenced by both theoretical and empirical results' is load-bearing for the motivation, yet the text supplies neither equations, theorems, nor any experimental protocol or metric quantifying this persistence.
The construction of the SSMoE router is presented as leveraging 'spectral properties' to produce non-collapsing routing logits, but no derivation or argument is given showing why the top eigenvectors of W_e align with token-expert affinity directions learned during pretraining or remain stable under distribution shift; the skeptic concern that this may reduce to a static heuristic therefore cannot be assessed.
The weakest assumption—that the observed semantic encoding is sufficiently general and invariant to expert specialization and data shifts—is stated but not tested with any controlled ablation (e.g., across different pretraining distributions or expert counts) that would be required to support the 'strong generalization' claim.

minor comments (1)

[Abstract] The abstract is unusually dense and would benefit from a single sentence clarifying the exact routing rule (e.g., how many leading singular vectors are retained and how they are combined into logits).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below, clarifying the manuscript's claims and indicating where revisions will strengthen the presentation. Our responses focus on substance and aim to improve clarity without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that collapse 'persists when advancing well-pretrained SMoE models, as evidenced by both theoretical and empirical results' is load-bearing for the motivation, yet the text supplies neither equations, theorems, nor any experimental protocol or metric quantifying this persistence.

Authors: We agree that the abstract statement would benefit from explicit pointers to the supporting material. Section 3 of the manuscript contains the theoretical analysis, including equations and a proof sketch showing that standard router improvements do not eliminate collapse in pretrained models. Section 4 presents the corresponding empirical protocol, using load-balance metrics and performance degradation under continued training. To make this immediately visible, we will revise the abstract to reference these sections and briefly note the key metrics employed. revision: yes
Referee: [—] The construction of the SSMoE router is presented as leveraging 'spectral properties' to produce non-collapsing routing logits, but no derivation or argument is given showing why the top eigenvectors of W_e align with token-expert affinity directions learned during pretraining or remain stable under distribution shift; the skeptic concern that this may reduce to a static heuristic therefore cannot be assessed.

Authors: The manuscript presents the alignment as an empirical observation supported by visualization and downstream performance, rather than a formal derivation. We will add a short explanatory paragraph in Section 3 that connects the top eigenvectors to the principal directions of expert specialization observed during pretraining, drawing on the fact that expert weight matrices encode token-feature covariances. While a full stability proof under arbitrary shifts is beyond the current scope, we will include a brief discussion of why the leading singular vectors are expected to be more robust than learned routers. This addition will allow readers to evaluate the heuristic concern directly. revision: partial
Referee: [—] The weakest assumption—that the observed semantic encoding is sufficiently general and invariant to expert specialization and data shifts—is stated but not tested with any controlled ablation (e.g., across different pretraining distributions or expert counts) that would be required to support the 'strong generalization' claim.

Authors: We acknowledge that the generalization claim would be stronger with additional controlled ablations. The current experiments already span multiple model scales, modalities, and both clean and corrupted data, but they do not systematically vary pretraining corpora or expert counts. We will add a new ablation subsection that reports results for models pretrained on different data distributions and with varying numbers of experts, thereby directly testing the invariance assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation leads to heuristic method without self-referential reduction

full rationale

The paper's core contribution is an empirical observation that eigenvectors of expert weight matrices encode semantic information, followed by the proposal of a training-free SSMoE router based on SVD. No load-bearing step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional equivalence. The abstract and described approach treat the spectral property as an observed fact used to motivate a new construction, with performance validated by experiments rather than derived by construction from the inputs. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5781 in / 1071 out tokens · 24508 ms · 2026-06-28T23:24:31.654343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 10 canonical work pages · 2 internal anchors

[1]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclantho logy.org/N19-1300/. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge,

work page doi:10.18653/v1/n19-1300
[2]

Training Verifiers to Solve Math Word Problems

URL https://arxiv.org/abs/1803.0 5457. Coates, A., Ng, A., and Lee, H. An analysis of single- layer networks in unsupervised feature learning. In Gor- don, G., Dunson, D., and Dud ´ık, M. (eds.),Proceed- ings of the Fourteenth International Conference on Arti- ficial Intelligence and Statistics, volume 15 ofProceed- ings of Machine Learning Research, pp. ...

work page internal anchor Pith review Pith/arXiv arXiv 2011
[3]

Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F

URL https://arxiv.org/abs/2504.0 5342. Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F. StableMoE: Stable routing strategy for mixture of experts. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.),Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7085–7095, Dubli...
[4]

doi: 10.18653/v1/2022.acl-long.489

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.489. URL https://acla nthology.org/2022.acl-long.489/. Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y ., Huang, P., Luo, 10 Eigenvectors of Experts are Training-free Non-collapsing Routers F., Ruan, C., Sui, Z., and Liang, W...

work page doi:10.18653/v1/2022.acl-long.489 2022
[5]

Dutta, A., Krishnan, S., Kwatra, N., and Ramjee, R

URL https://proceedings.mlr.pres s/v162/du22c.html. Dutta, A., Krishnan, S., Kwatra, N., and Ramjee, R. Accu- racy is not all you need. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Sys- tems, volume 37, pp. 124347–124390. Curran Associates, Inc., 2024. doi: 10....

work page doi:10.52202/079017-3950 2024
[6]

Jiang, T., Huang, S., Luan, Z., Wang, D., and Zhuang, F

URL https://arxiv.org/abs/2307.1 6645. Jiang, T., Huang, S., Luan, Z., Wang, D., and Zhuang, F. Scaling sentence embeddings with large language models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), Findings of the Association for Computational Linguis- tics: EMNLP 2024, pp. 3182–3196, Miami, Florida, USA, November 2024. Association for Computat...

work page doi:10.18653/v1/2024.findings-emnlp.181 2024
[7]

URL https://doi.org/10.1109/TPAMI.2025.3 532688

doi: 10.1109/TPAMI.2025.3532688. URL https://doi.org/10.1109/TPAMI.2025.3 532688. Li, Z. and Zhou, T. Your mixture-of-experts LLM is se- cretly an embedding model for free. InThe Thirteenth International Conference on Learning Representations,

work page doi:10.1109/tpami.2025.3532688 2025
[8]

Can a suit of armor conduct electricity? a new dataset for open book question answering

URL https://openreview.net/forum ?id=eFGQ97z5Cd. Li, Z., Liang, C., Zhang, Z., Hong, I., Kim, Y . J., Chen, W., and Zhao, T. Slimmoe: Structured compression of large moe models via expert slimming and distillation. In Second Conference on Language Modeling, 2025b. URL https://openreview.net/forum?id=oaCU sn391F. Lin, T.-Y ., Maire, M., Belongie, S., Hays,...

work page doi:10.18653/v1/d18-1260 2014
[9]

gpt-oss-120b & gpt-oss-20b Model Card

URL https://openreview.net/forum ?id=Pu3c0209cx. OpenAI, :, Agarwal, S., Ahmad, L., Ai, J., Altman, S., Ap- plebaum, A., Arbus, E., Arora, R. K., Bai, Y ., Baker, B., Bao, H., Barak, B., Bennett, A., Bertao, T., Brett, N., Brevdo, E., Brockman, G., Bubeck, S., Chang, C., Chen, K., Chen, M., Cheung, E., Clark, A., Cook, D., Dukhan, M., Dvorak, C., Fives, K...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

In: 2015 IEEE International Conference on Computer Vision (ICCV)

doi: 10.1109/ICCV.2015.303. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.),Proceedings of the 38th International Conference on Machine Learning, vol...

work page doi:10.1109/iccv.2015.303 2015
[11]

Eigenvectors of Experts are Training-free Non-collapsing Routers

URL https://openreview.net/forum ?id=B1ckMDqlg. Shen, L., Chen, G., Shao, R., Guan, W., and Nie, L. Mome: Mixture of multimodal experts for generalist multimodal large language models. In Globerson, A., Mackey, L., Bel- grave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Sys- tems, volume 37, pp. 420...

work page doi:10.52202/079017-1330 2024

[1] [1]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclantho logy.org/N19-1300/. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge,

work page doi:10.18653/v1/n19-1300

[2] [2]

Training Verifiers to Solve Math Word Problems

URL https://arxiv.org/abs/1803.0 5457. Coates, A., Ng, A., and Lee, H. An analysis of single- layer networks in unsupervised feature learning. In Gor- don, G., Dunson, D., and Dud ´ık, M. (eds.),Proceed- ings of the Fourteenth International Conference on Arti- ficial Intelligence and Statistics, volume 15 ofProceed- ings of Machine Learning Research, pp. ...

work page internal anchor Pith review Pith/arXiv arXiv 2011

[3] [3]

Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F

URL https://arxiv.org/abs/2504.0 5342. Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F. StableMoE: Stable routing strategy for mixture of experts. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.),Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7085–7095, Dubli...

[4] [4]

doi: 10.18653/v1/2022.acl-long.489

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.489. URL https://acla nthology.org/2022.acl-long.489/. Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y ., Huang, P., Luo, 10 Eigenvectors of Experts are Training-free Non-collapsing Routers F., Ruan, C., Sui, Z., and Liang, W...

work page doi:10.18653/v1/2022.acl-long.489 2022

[5] [5]

Dutta, A., Krishnan, S., Kwatra, N., and Ramjee, R

URL https://proceedings.mlr.pres s/v162/du22c.html. Dutta, A., Krishnan, S., Kwatra, N., and Ramjee, R. Accu- racy is not all you need. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Sys- tems, volume 37, pp. 124347–124390. Curran Associates, Inc., 2024. doi: 10....

work page doi:10.52202/079017-3950 2024

[6] [6]

Jiang, T., Huang, S., Luan, Z., Wang, D., and Zhuang, F

URL https://arxiv.org/abs/2307.1 6645. Jiang, T., Huang, S., Luan, Z., Wang, D., and Zhuang, F. Scaling sentence embeddings with large language models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), Findings of the Association for Computational Linguis- tics: EMNLP 2024, pp. 3182–3196, Miami, Florida, USA, November 2024. Association for Computat...

work page doi:10.18653/v1/2024.findings-emnlp.181 2024

[7] [7]

URL https://doi.org/10.1109/TPAMI.2025.3 532688

doi: 10.1109/TPAMI.2025.3532688. URL https://doi.org/10.1109/TPAMI.2025.3 532688. Li, Z. and Zhou, T. Your mixture-of-experts LLM is se- cretly an embedding model for free. InThe Thirteenth International Conference on Learning Representations,

work page doi:10.1109/tpami.2025.3532688 2025

[8] [8]

Can a suit of armor conduct electricity? a new dataset for open book question answering

URL https://openreview.net/forum ?id=eFGQ97z5Cd. Li, Z., Liang, C., Zhang, Z., Hong, I., Kim, Y . J., Chen, W., and Zhao, T. Slimmoe: Structured compression of large moe models via expert slimming and distillation. In Second Conference on Language Modeling, 2025b. URL https://openreview.net/forum?id=oaCU sn391F. Lin, T.-Y ., Maire, M., Belongie, S., Hays,...

work page doi:10.18653/v1/d18-1260 2014

[9] [9]

gpt-oss-120b & gpt-oss-20b Model Card

URL https://openreview.net/forum ?id=Pu3c0209cx. OpenAI, :, Agarwal, S., Ahmad, L., Ai, J., Altman, S., Ap- plebaum, A., Arbus, E., Arora, R. K., Bai, Y ., Baker, B., Bao, H., Barak, B., Bennett, A., Bertao, T., Brett, N., Brevdo, E., Brockman, G., Bubeck, S., Chang, C., Chen, K., Chen, M., Cheung, E., Clark, A., Cook, D., Dukhan, M., Dvorak, C., Fives, K...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

In: 2015 IEEE International Conference on Computer Vision (ICCV)

doi: 10.1109/ICCV.2015.303. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.),Proceedings of the 38th International Conference on Machine Learning, vol...

work page doi:10.1109/iccv.2015.303 2015

[11] [11]

Eigenvectors of Experts are Training-free Non-collapsing Routers

URL https://openreview.net/forum ?id=B1ckMDqlg. Shen, L., Chen, G., Shao, R., Guan, W., and Nie, L. Mome: Mixture of multimodal experts for generalist multimodal large language models. In Globerson, A., Mackey, L., Bel- grave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Sys- tems, volume 37, pp. 420...

work page doi:10.52202/079017-1330 2024