pith. sign in

arxiv: 2502.15315 · v3 · submitted 2025-02-21 · 💻 cs.LG

Tight Clusters Make Specialized Experts

Pith reviewed 2026-05-23 02:51 UTC · model grok-4.3

classification 💻 cs.LG
keywords mixture of expertsadaptive clusteringrouter designfeature weightingtoken routingsparse modelscluster separation
0
0 comments X

The pith

Deriving feature weights from cluster tightness lets the MoE router match tokens to experts in a better-separated space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard routers in sparse Mixture-of-Experts models fail to identify latent clusters in high-dimensional inputs, slowing convergence and hurting robustness. By deriving optimal weights that scale each feature by how tightly an expert's tokens cluster along it, the new Adaptive Clustering router transforms the input space so clusters separate more clearly. Routing then occurs in this space, producing faster convergence, greater resistance to data corruption, and higher final accuracy on language and image tasks. A sympathetic reader would care because better routing means experts truly specialize rather than overlap, which is the core promise of MoE architectures. The method is tested across multiple backbones in both clean and corrupted settings.

Core claim

The Adaptive Clustering router computes a set of weights for each expert cluster that scales features according to whether that expert clusters tightly along that feature. These weights transform the space in which token-expert routing assignments are computed, promoting well-separated clusters and reliable matching. This yields faster convergence, better robustness to data corruption, and overall performance gains because experts specialize in semantically distinct regions of the input space.

What carries the argument

The Adaptive Clustering (AC) router, which uses per-expert feature weights derived to maximize cluster identification for routing in an adaptively transformed space.

If this is right

  • MoE models using the AC router converge faster than those with standard routers.
  • These models show improved robustness when trained on corrupted data.
  • Final performance increases on language modeling and image recognition tasks.
  • Experts become specialized in distinct semantic regions rather than overlapping.
  • The benefits hold across various MoE architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same weighting idea could be applied to other clustering-based assignment problems outside MoE.
  • If the weights can be computed efficiently, the overhead may be low enough for large-scale training.
  • Models might require fewer experts or smaller expert sizes if specialization is tighter.
  • Extending the method to dynamic or online cluster identification would test its adaptability to changing data distributions.

Load-bearing premise

The derived optimal feature weights produce a transformed space in which the latent clusters are sufficiently well-separated to allow reliable token-expert matching.

What would settle it

Training an MoE model with the AC router on a high-dimensional dataset with known but overlapping clusters and observing no gain in convergence rate or noise robustness compared to a baseline router would falsify the central claim.

Figures

Figures reproduced from arXiv: 2502.15315 by Laziz U. Abdullaev, Rachel S.Y. Teo, Stefan K. Nielsen, Tan M. Nguyen.

Figure 1
Figure 1. Figure 1: ACMoE discovers semantically distinct regions. We show 14x14 image reconstructions where [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fast Convergence of ACMoE. Left: Convergence speed on WikiText-103 pretraining using the Generalist Language Model (Du et al., 2022) backbone. Right: Convergence speed on Banking-77 finetuning using the Switch Transformer (Fedus et al., 2022) backbone. Across both backbones and tasks, we observe substantially faster convergence. We display final test perplexity (PPL) and accuracy (Acc.), showing better ove… view at source ↗
Figure 3
Figure 3. Figure 3: ACMoE and Swin Transformer under PGD attack at increasing perturbation budgets. ACMoE widens [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cluster Visualization on ImageNet. Each token is represented as a point and colored by its assigned [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Router Instability of ACMoE, SMoE, XMoE, and StableMoE. ACMoE maintains consis￾tent routing, while baseline routers more frequently change the expert assignments of tokens. In [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
read the original abstract

Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an Adaptive Clustering (AC) router for Sparse Mixture-of-Experts (MoE) models. It derives, for each expert, a set of feature weights that scale dimensions according to how tightly the expert's assigned tokens cluster along each feature. These weights are used to route tokens in an adaptively transformed space intended to make latent clusters more separable, thereby improving token-expert matching. The authors claim this yields three linked benefits—faster convergence, greater robustness to data corruption, and higher overall performance—and support the claims with experiments on language modeling and image recognition tasks using multiple MoE backbones in both clean and corrupted data regimes.

Significance. If the weight derivation is shown to increase cluster separation with a verifiable guarantee and the reported gains hold under controlled ablations, the contribution would be meaningful for MoE training stability. The work directly targets the high-dimensional identifiability problem that standard routers face and provides empirical results across clean and corrupted settings, which is a practical strength. No machine-checked proofs or parameter-free derivations are present, but the focus on an explicit clustering lens for routing is a clear conceptual step.

major comments (2)
  1. [§3] §3 (Method), derivation of feature weights: the optimality claim is stated without an explicit objective function, stationarity condition, or separation bound (e.g., no intra-/inter-cluster ratio, margin, or convergence guarantee for the transformed metric). Because the three claimed benefits rest on the transformed space producing reliably better token-expert matching than the original space, the absence of such a guarantee is load-bearing.
  2. [§4] §4 (Experiments), robustness and convergence claims: the reported gains on corrupted data and faster convergence are presented without an ablation that isolates the effect of the adaptive transformation from other implementation choices (e.g., initialization, regularization, or routing temperature). This makes it difficult to attribute the benefits mechanistically to the cluster-tightness weighting.
minor comments (2)
  1. [§3] Notation for the per-expert weight vector is introduced without a clear symbol table or consistent use across equations and algorithm boxes.
  2. [Figure 3] Figure captions for the routing visualizations do not state the exact corruption type, noise level, or number of tokens shown, reducing reproducibility of the qualitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, providing clarifications on the method and experiments while committing to revisions where the concerns are valid.

read point-by-point responses
  1. Referee: [§3] §3 (Method), derivation of feature weights: the optimality claim is stated without an explicit objective function, stationarity condition, or separation bound (e.g., no intra-/inter-cluster ratio, margin, or convergence guarantee for the transformed metric). Because the three claimed benefits rest on the transformed space producing reliably better token-expert matching than the original space, the absence of such a guarantee is load-bearing.

    Authors: We agree that the derivation would benefit from greater formality. The feature weights are computed per expert as the inverse of the per-dimension variance of tokens assigned to that expert, which has the effect of up-weighting dimensions along which the expert's tokens form a tight cluster. This choice is motivated by the objective of increasing the relative contribution of low-variance (tight) dimensions in the routing distance computation. However, the manuscript does not supply an explicit optimization objective, stationarity condition, or separation bound, nor does it prove that the transformed metric yields strictly better matching than the original space. We will revise §3 to state the objective explicitly as minimizing a weighted intra-cluster variance and to clarify that the three claimed benefits are supported empirically rather than by a theoretical guarantee. revision: yes

  2. Referee: [§4] §4 (Experiments), robustness and convergence claims: the reported gains on corrupted data and faster convergence are presented without an ablation that isolates the effect of the adaptive transformation from other implementation choices (e.g., initialization, regularization, or routing temperature). This makes it difficult to attribute the benefits mechanistically to the cluster-tightness weighting.

    Authors: The referee is correct that the current experimental design compares complete routers rather than isolating the adaptive weighting step. While the AC router is evaluated against standard Top-k and noisy Top-k baselines on multiple backbones and both clean and corrupted regimes, we do not present an ablation that applies only the learned feature weights while freezing all other routing hyperparameters. In the revision we will add a controlled ablation that replaces the learned weights with uniform scaling (or with weights derived from a non-adaptive statistic) while keeping initialization, temperature, and regularization identical, thereby quantifying the isolated contribution of the cluster-tightness weighting to convergence speed and robustness. revision: yes

Circularity Check

1 steps flagged

Optimal feature weights defined from cluster tightness on the same assignments they enable

specific steps
  1. self definitional [Abstract]
    "We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. ... for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature."

    The weights are defined to scale features by the tightness of the expert's clusters along each feature; those same weights are then used to produce the routing assignments that define the clusters. The optimality criterion is therefore satisfied by construction once the weights are applied to the clustering they were derived from, with no shown external guarantee that the transformation increases separability beyond the input clustering.

full rationale

The paper's core mechanism derives feature weights 'that maximally identify the latent clusters' and 'scales features according to whether that expert clusters tightly along that feature,' then uses those weights for routing in the transformed space. This step is load-bearing for the claimed convergence/robustness gains. No independent objective function, separation bound, or external validation is shown in the provided text; the optimality criterion is stated directly in terms of the cluster tightness that the router is simultaneously trying to produce. This matches the self-definitional pattern. The derivation chain therefore reduces the 'prediction' of better-separated clusters to quantities computed from the clustering itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claim rests on the existence of identifiable latent clusters and the validity of the weight derivation.

axioms (1)
  • domain assumption Latent clusters exist in the input distribution and can be maximally identified by per-expert feature weights that scale features according to cluster tightness.
    The abstract states that the router learns the underlying clustering structure and that the new weights promote well-separated clusters; this premise is required for the routing improvement to hold.

pith-pipeline@v0.9.0 · 5790 in / 1329 out tokens · 44129 ms · 2026-05-23T02:51:43.612059+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 6 internal anchors

  1. [1]

    Transformer meets twicing: Harnessing unattended residual information

    Laziz Abdullaev and Tan Minh Nguyen. Transformer meets twicing: Harnessing unattended residual information. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=16kG5aNleS

  2. [2]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020

  3. [3]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

  4. [4]

    Conditional Computation in Neural Networks for faster models

    Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015

  5. [5]

    A variable-selection heuristic for k-means clustering

    Michael J Brusco and J Dennis Cradit. A variable-selection heuristic for k-means clustering. Psychometrika, 66: 0 249--270, 2001

  6. [6]

    Efficient intent detection with dual sentence encoders

    I \ n igo Casanueva, Tadas Tem c inas, Daniela Gerz, Matthew Henderson, and Ivan Vuli \'c . Efficient intent detection with dual sentence encoders. arXiv preprint arXiv:2003.04807, 2020

  7. [7]

    Towards understanding the mixture-of-experts layer in deep learning

    Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning. Advances in neural information processing systems, 35: 0 23049--23062, 2022

  8. [8]

    On the representation collapse of sparse mixture of experts

    Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35: 0 34600--34613, 2022

  9. [9]

    S table M o E : Stable routing strategy for mixture of experts

    Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. S table M o E : Stable routing strategy for mixture of experts. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7085--7095, Dublin, Irel...

  10. [10]

    Maximum likelihood from incomplete data via the em algorithm

    Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39 0 (1): 0 1--22, 1977

  11. [11]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

  12. [12]

    On the benefits of learning to route in mixture-of-experts models

    Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Panigrahy, Nikhil Vyas, and Xin Wang. On the benefits of learning to route in mixture-of-experts models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 9376--9396, 2023

  13. [13]

    Glam: Efficient scaling of language models with mixture-of-experts

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.\ 5547--5569. PMLR, 2022

  14. [14]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

  15. [15]

    Clustering objects on subsets of attributes (with discussion)

    Jerome H Friedman and Jacqueline J Meulman. Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society Series B: Statistical Methodology, 66 0 (4): 0 815--849, 2004

  16. [16]

    Weighting and selection of variables for cluster analysis

    Ram Gnanadesikan, Jon R Kettenring, and Shiao Li Tsao. Weighting and selection of variables for cluster analysis. Journal of classification, 12: 0 113--136, 1995

  17. [17]

    Explaining and Harnessing Adversarial Examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014

  18. [18]

    Dynamic mixture of experts: An auto-tuning approach for efficient transformer models

    Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, and Tao Lin. Dynamic mixture of experts: An auto-tuning approach for efficient transformer models. arXiv preprint arXiv:2405.14297, 2024

  19. [19]

    Designing robust transformers using robust kernel density estimation

    Xing Han, Tongzheng Ren, Tan Nguyen, Khai Nguyen, Joydeep Ghosh, and Nhat Ho. Designing robust transformers using robust kernel density estimation. Advances in Neural Information Processing Systems, 36, 2024

  20. [20]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 8340--8349, 2021 a

  21. [21]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 15262--15271, 2021 b

  22. [22]

    Adaptive mixtures of local experts

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3 0 (1): 0 79--87, 1991

  23. [23]

    Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition

    Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei Zuo, Devang Patel, Eric Sun, and Yu Shi. Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition. arXiv preprint arXiv:2112.05820, 2021

  24. [24]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    D Lepikhin, H Lee, Y Xu, D Chen, O Firat, Y Huang, M Krikun, N Shazeer, and Z Gshard. Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

  25. [25]

    Base layers: Simplifying training of large, sparse models

    Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pp.\ 6265--6274. PMLR, 2021

  26. [26]

    Sparsity-constrained optimal transport

    Tianlin Liu, Joan Puigcerver, and Mathieu Blondel. Sparsity-constrained optimal transport. arXiv preprint arXiv:2209.15466, 2022

  27. [27]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 10012--10022, 2021

  28. [28]

    Asymptotic convergence rate of the em algorithm for gaussian mixtures

    Jinwen Ma, Lei Xu, and Michael I Jordan. Asymptotic convergence rate of the em algorithm for gaussian mixtures. Neural Computation, 12 0 (12): 0 2881--2907, 2000

  29. [29]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. stat, 1050 0 (9), 2017

  30. [30]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

  31. [31]

    Pidformer: Transformer meets control theory

    Tam Minh Nguyen, C \'e sar A Uribe, Tan Minh Nguyen, and Richard Baraniuk. Pidformer: Transformer meets control theory. In Forty-first International Conference on Machine Learning, 2024

  32. [32]

    Bertozzi, Richard Baraniuk, and Stanley Osher

    Tan Minh Nguyen, Tam Minh Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard Baraniuk, and Stanley Osher. A primal-dual framework for transformers and neural networks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=U_T8-5hClV

  33. [33]

    CAME x: Curvature-aware merging of experts

    Viet Dung Nguyen, Minh Nguyen Hoang, Luc Nguyen, Rachel Teo, Tan Minh Nguyen, and Linh Duy Tran. CAME x: Curvature-aware merging of experts. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=nT2u0M0nf8

  34. [34]

    Elliptical attention

    Stefan Nielsen, Laziz Abdullaev, Rachel SY Teo, and Tan Nguyen. Elliptical attention. Advances in Neural Information Processing Systems, 37: 0 109748--109789, 2025

  35. [35]

    Competesmoe--effective training of sparse mixture of experts via competition

    Quang Pham, Giang Do, Huy Nguyen, TrungTin Nguyen, Chenghao Liu, Mina Sartipi, Binh T Nguyen, Savitha Ramasamy, Xiaoli Li, Steven Hoi, et al. Competesmoe--effective training of sparse mixture of experts via competition. arXiv preprint arXiv:2402.02526, 2024

  36. [36]

    On the adversarial robustness of mixture of experts

    Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme, Pranjal Awasthi, and Srinadh Bhojanapalli. On the adversarial robustness of mixture of experts. Advances in Neural Information Processing Systems, 35: 0 9660--9671, 2022

  37. [37]

    From sparse to soft mixtures of experts

    Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. arXiv preprint arXiv:2308.00951, 2023

  38. [38]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  39. [39]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

  40. [40]

    Scaling vision with sparse mixture of experts

    Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr \'e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 0 8583--8595, 2021

  41. [41]

    Hash layers for large sparse models

    Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34: 0 17555--17566, 2021

  42. [42]

    The sparsely-gated mixture-of-experts layer

    N Shazeer, A Mirhoseini, K Maziarz, A Davis, Q Le, G Hinton, and J Dean. The sparsely-gated mixture-of-experts layer. Outrageously large neural networks, 2017

  43. [43]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.\ 1631--1642, 2013

  44. [44]

    Momentum SM oe: Integrating momentum into sparse mixture of experts

    Rachel Teo and Tan Minh Nguyen. Momentum SM oe: Integrating momentum into sparse mixture of experts. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=y929esCZNJ

  45. [45]

    Mo LE x: Mixture of layer experts for fine-tuning with sparse upcycling

    Rachel Teo and Tan Minh Nguyen. Mo LE x: Mixture of layer experts for fine-tuning with sparse upcycling. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=rWui9vLhOc

  46. [46]

    Unveiling the hidden structure of self-attention via kernel principal component analysis

    Rachel SY Teo and Tan Nguyen. Unveiling the hidden structure of self-attention via kernel principal component analysis. Advances in Neural Information Processing Systems, 37: 0 101393--101427, 2025 b

  47. [47]

    Adversarial risk and the dangers of evaluating against weak attacks

    Jonathan Uesato, Brendan O’donoghue, Pushmeet Kohli, and Aaron Oord. Adversarial risk and the dangers of evaluating against weak attacks. In International conference on machine learning, pp.\ 5025--5034. PMLR, 2018

  48. [48]

    Clustering n objects into k groups under optimal scaling of variables

    Stef Van Buuren and Willem J Heiser. Clustering n objects into k groups under optimal scaling of variables. Psychometrika, 54: 0 699--706, 1989

  49. [49]

    Attention is all you need

    A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017

  50. [50]

    A framework for feature selection in clustering

    Daniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105 0 (490): 0 713--726, 2010

  51. [51]

    Robust mixture-of-expert training for convolutional neural networks

    Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, and Sijia Liu. Robust mixture-of-expert training for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 90--101, 2023

  52. [52]

    Understanding the robustness in vision transformers

    Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. In International Conference on Machine Learning, pp.\ 27378--27394. PMLR, 2022 a

  53. [53]

    Mixture-of-experts with expert choice routing

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35: 0 7103--7114, 2022 b

  54. [54]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  55. [55]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  56. [56]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  57. [57]

    Wqqqxh; NF 骫?x]=?7 h n( NFH2:##ʯ ##GF myD ˈt2 dćzlJȈt2dD Q R ɈtI͈l ߠu,Y O>d<Sscձzꨯ Ā bȐ! <vumb뭷I֭?fΜ ?7ߌ򨮮 Űab vSƗnrR'< ܡ ꪫ O;v ƨQrTQfǕW^ O : / ? i甗 xGb UvFfϞ)שvӯ_MS]]ke

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...