Tight Clusters Make Specialized Experts
Pith reviewed 2026-05-23 02:51 UTC · model grok-4.3
The pith
Deriving feature weights from cluster tightness lets the MoE router match tokens to experts in a better-separated space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Adaptive Clustering router computes a set of weights for each expert cluster that scales features according to whether that expert clusters tightly along that feature. These weights transform the space in which token-expert routing assignments are computed, promoting well-separated clusters and reliable matching. This yields faster convergence, better robustness to data corruption, and overall performance gains because experts specialize in semantically distinct regions of the input space.
What carries the argument
The Adaptive Clustering (AC) router, which uses per-expert feature weights derived to maximize cluster identification for routing in an adaptively transformed space.
If this is right
- MoE models using the AC router converge faster than those with standard routers.
- These models show improved robustness when trained on corrupted data.
- Final performance increases on language modeling and image recognition tasks.
- Experts become specialized in distinct semantic regions rather than overlapping.
- The benefits hold across various MoE architectures.
Where Pith is reading between the lines
- The same weighting idea could be applied to other clustering-based assignment problems outside MoE.
- If the weights can be computed efficiently, the overhead may be low enough for large-scale training.
- Models might require fewer experts or smaller expert sizes if specialization is tighter.
- Extending the method to dynamic or online cluster identification would test its adaptability to changing data distributions.
Load-bearing premise
The derived optimal feature weights produce a transformed space in which the latent clusters are sufficiently well-separated to allow reliable token-expert matching.
What would settle it
Training an MoE model with the AC router on a high-dimensional dataset with known but overlapping clusters and observing no gain in convergence rate or noise robustness compared to a baseline router would falsify the central claim.
Figures
read the original abstract
Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an Adaptive Clustering (AC) router for Sparse Mixture-of-Experts (MoE) models. It derives, for each expert, a set of feature weights that scale dimensions according to how tightly the expert's assigned tokens cluster along each feature. These weights are used to route tokens in an adaptively transformed space intended to make latent clusters more separable, thereby improving token-expert matching. The authors claim this yields three linked benefits—faster convergence, greater robustness to data corruption, and higher overall performance—and support the claims with experiments on language modeling and image recognition tasks using multiple MoE backbones in both clean and corrupted data regimes.
Significance. If the weight derivation is shown to increase cluster separation with a verifiable guarantee and the reported gains hold under controlled ablations, the contribution would be meaningful for MoE training stability. The work directly targets the high-dimensional identifiability problem that standard routers face and provides empirical results across clean and corrupted settings, which is a practical strength. No machine-checked proofs or parameter-free derivations are present, but the focus on an explicit clustering lens for routing is a clear conceptual step.
major comments (2)
- [§3] §3 (Method), derivation of feature weights: the optimality claim is stated without an explicit objective function, stationarity condition, or separation bound (e.g., no intra-/inter-cluster ratio, margin, or convergence guarantee for the transformed metric). Because the three claimed benefits rest on the transformed space producing reliably better token-expert matching than the original space, the absence of such a guarantee is load-bearing.
- [§4] §4 (Experiments), robustness and convergence claims: the reported gains on corrupted data and faster convergence are presented without an ablation that isolates the effect of the adaptive transformation from other implementation choices (e.g., initialization, regularization, or routing temperature). This makes it difficult to attribute the benefits mechanistically to the cluster-tightness weighting.
minor comments (2)
- [§3] Notation for the per-expert weight vector is introduced without a clear symbol table or consistent use across equations and algorithm boxes.
- [Figure 3] Figure captions for the routing visualizations do not state the exact corruption type, noise level, or number of tokens shown, reducing reproducibility of the qualitative results.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below, providing clarifications on the method and experiments while committing to revisions where the concerns are valid.
read point-by-point responses
-
Referee: [§3] §3 (Method), derivation of feature weights: the optimality claim is stated without an explicit objective function, stationarity condition, or separation bound (e.g., no intra-/inter-cluster ratio, margin, or convergence guarantee for the transformed metric). Because the three claimed benefits rest on the transformed space producing reliably better token-expert matching than the original space, the absence of such a guarantee is load-bearing.
Authors: We agree that the derivation would benefit from greater formality. The feature weights are computed per expert as the inverse of the per-dimension variance of tokens assigned to that expert, which has the effect of up-weighting dimensions along which the expert's tokens form a tight cluster. This choice is motivated by the objective of increasing the relative contribution of low-variance (tight) dimensions in the routing distance computation. However, the manuscript does not supply an explicit optimization objective, stationarity condition, or separation bound, nor does it prove that the transformed metric yields strictly better matching than the original space. We will revise §3 to state the objective explicitly as minimizing a weighted intra-cluster variance and to clarify that the three claimed benefits are supported empirically rather than by a theoretical guarantee. revision: yes
-
Referee: [§4] §4 (Experiments), robustness and convergence claims: the reported gains on corrupted data and faster convergence are presented without an ablation that isolates the effect of the adaptive transformation from other implementation choices (e.g., initialization, regularization, or routing temperature). This makes it difficult to attribute the benefits mechanistically to the cluster-tightness weighting.
Authors: The referee is correct that the current experimental design compares complete routers rather than isolating the adaptive weighting step. While the AC router is evaluated against standard Top-k and noisy Top-k baselines on multiple backbones and both clean and corrupted regimes, we do not present an ablation that applies only the learned feature weights while freezing all other routing hyperparameters. In the revision we will add a controlled ablation that replaces the learned weights with uniform scaling (or with weights derived from a non-adaptive statistic) while keeping initialization, temperature, and regularization identical, thereby quantifying the isolated contribution of the cluster-tightness weighting to convergence speed and robustness. revision: yes
Circularity Check
Optimal feature weights defined from cluster tightness on the same assignments they enable
specific steps
-
self definitional
[Abstract]
"We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. ... for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature."
The weights are defined to scale features by the tightness of the expert's clusters along each feature; those same weights are then used to produce the routing assignments that define the clusters. The optimality criterion is therefore satisfied by construction once the weights are applied to the clustering they were derived from, with no shown external guarantee that the transformation increases separability beyond the input clustering.
full rationale
The paper's core mechanism derives feature weights 'that maximally identify the latent clusters' and 'scales features according to whether that expert clusters tightly along that feature,' then uses those weights for routing in the transformed space. This step is load-bearing for the claimed convergence/robustness gains. No independent objective function, separation bound, or external validation is shown in the provided text; the optimality criterion is stated directly in terms of the cluster tightness that the router is simultaneously trying to produce. This matches the self-definitional pattern. The derivation chain therefore reduces the 'prediction' of better-separated clusters to quantities computed from the clustering itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Latent clusters exist in the input distribution and can be maximally identified by per-expert feature weights that scale features according to cluster tightness.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 (Optimal feature weights). ... w_qk = λ/d / (s_qk + α_k) ... inversely proportional to the measure of dispersion
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 1 (Adaptive Clustering Router Transformation M_k) ... diag(1/s_1k, …, 1/s_dk)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Transformer meets twicing: Harnessing unattended residual information
Laziz Abdullaev and Tan Minh Nguyen. Transformer meets twicing: Harnessing unattended residual information. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=16kG5aNleS
work page 2025
-
[2]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[3]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Conditional Computation in Neural Networks for faster models
Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[5]
A variable-selection heuristic for k-means clustering
Michael J Brusco and J Dennis Cradit. A variable-selection heuristic for k-means clustering. Psychometrika, 66: 0 249--270, 2001
work page 2001
-
[6]
Efficient intent detection with dual sentence encoders
I \ n igo Casanueva, Tadas Tem c inas, Daniela Gerz, Matthew Henderson, and Ivan Vuli \'c . Efficient intent detection with dual sentence encoders. arXiv preprint arXiv:2003.04807, 2020
-
[7]
Towards understanding the mixture-of-experts layer in deep learning
Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning. Advances in neural information processing systems, 35: 0 23049--23062, 2022
work page 2022
-
[8]
On the representation collapse of sparse mixture of experts
Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35: 0 34600--34613, 2022
work page 2022
-
[9]
S table M o E : Stable routing strategy for mixture of experts
Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. S table M o E : Stable routing strategy for mixture of experts. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7085--7095, Dublin, Irel...
-
[10]
Maximum likelihood from incomplete data via the em algorithm
Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39 0 (1): 0 1--22, 1977
work page 1977
-
[11]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009
work page 2009
-
[12]
On the benefits of learning to route in mixture-of-experts models
Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Panigrahy, Nikhil Vyas, and Xin Wang. On the benefits of learning to route in mixture-of-experts models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 9376--9396, 2023
work page 2023
-
[13]
Glam: Efficient scaling of language models with mixture-of-experts
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.\ 5547--5569. PMLR, 2022
work page 2022
-
[14]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022
work page 2022
-
[15]
Clustering objects on subsets of attributes (with discussion)
Jerome H Friedman and Jacqueline J Meulman. Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society Series B: Statistical Methodology, 66 0 (4): 0 815--849, 2004
work page 2004
-
[16]
Weighting and selection of variables for cluster analysis
Ram Gnanadesikan, Jon R Kettenring, and Shiao Li Tsao. Weighting and selection of variables for cluster analysis. Journal of classification, 12: 0 113--136, 1995
work page 1995
-
[17]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[18]
Dynamic mixture of experts: An auto-tuning approach for efficient transformer models
Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, and Tao Lin. Dynamic mixture of experts: An auto-tuning approach for efficient transformer models. arXiv preprint arXiv:2405.14297, 2024
-
[19]
Designing robust transformers using robust kernel density estimation
Xing Han, Tongzheng Ren, Tan Nguyen, Khai Nguyen, Joydeep Ghosh, and Nhat Ho. Designing robust transformers using robust kernel density estimation. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[20]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 8340--8349, 2021 a
work page 2021
-
[21]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 15262--15271, 2021 b
work page 2021
-
[22]
Adaptive mixtures of local experts
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3 0 (1): 0 79--87, 1991
work page 1991
-
[23]
Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition
Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei Zuo, Devang Patel, Eric Sun, and Yu Shi. Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition. arXiv preprint arXiv:2112.05820, 2021
-
[24]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
D Lepikhin, H Lee, Y Xu, D Chen, O Firat, Y Huang, M Krikun, N Shazeer, and Z Gshard. Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[25]
Base layers: Simplifying training of large, sparse models
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pp.\ 6265--6274. PMLR, 2021
work page 2021
-
[26]
Sparsity-constrained optimal transport
Tianlin Liu, Joan Puigcerver, and Mathieu Blondel. Sparsity-constrained optimal transport. arXiv preprint arXiv:2209.15466, 2022
-
[27]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 10012--10022, 2021
work page 2021
-
[28]
Asymptotic convergence rate of the em algorithm for gaussian mixtures
Jinwen Ma, Lei Xu, and Michael I Jordan. Asymptotic convergence rate of the em algorithm for gaussian mixtures. Neural Computation, 12 0 (12): 0 2881--2907, 2000
work page 2000
-
[29]
Towards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. stat, 1050 0 (9), 2017
work page 2017
-
[30]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[31]
Pidformer: Transformer meets control theory
Tam Minh Nguyen, C \'e sar A Uribe, Tan Minh Nguyen, and Richard Baraniuk. Pidformer: Transformer meets control theory. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[32]
Bertozzi, Richard Baraniuk, and Stanley Osher
Tan Minh Nguyen, Tam Minh Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard Baraniuk, and Stanley Osher. A primal-dual framework for transformers and neural networks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=U_T8-5hClV
work page 2023
-
[33]
CAME x: Curvature-aware merging of experts
Viet Dung Nguyen, Minh Nguyen Hoang, Luc Nguyen, Rachel Teo, Tan Minh Nguyen, and Linh Duy Tran. CAME x: Curvature-aware merging of experts. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=nT2u0M0nf8
work page 2025
-
[34]
Stefan Nielsen, Laziz Abdullaev, Rachel SY Teo, and Tan Nguyen. Elliptical attention. Advances in Neural Information Processing Systems, 37: 0 109748--109789, 2025
work page 2025
-
[35]
Competesmoe--effective training of sparse mixture of experts via competition
Quang Pham, Giang Do, Huy Nguyen, TrungTin Nguyen, Chenghao Liu, Mina Sartipi, Binh T Nguyen, Savitha Ramasamy, Xiaoli Li, Steven Hoi, et al. Competesmoe--effective training of sparse mixture of experts via competition. arXiv preprint arXiv:2402.02526, 2024
-
[36]
On the adversarial robustness of mixture of experts
Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme, Pranjal Awasthi, and Srinadh Bhojanapalli. On the adversarial robustness of mixture of experts. Advances in Neural Information Processing Systems, 35: 0 9660--9671, 2022
work page 2022
-
[37]
From sparse to soft mixtures of experts
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. arXiv preprint arXiv:2308.00951, 2023
-
[38]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[39]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020
work page 2020
-
[40]
Scaling vision with sparse mixture of experts
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr \'e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 0 8583--8595, 2021
work page 2021
-
[41]
Hash layers for large sparse models
Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34: 0 17555--17566, 2021
work page 2021
-
[42]
The sparsely-gated mixture-of-experts layer
N Shazeer, A Mirhoseini, K Maziarz, A Davis, Q Le, G Hinton, and J Dean. The sparsely-gated mixture-of-experts layer. Outrageously large neural networks, 2017
work page 2017
-
[43]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.\ 1631--1642, 2013
work page 2013
-
[44]
Momentum SM oe: Integrating momentum into sparse mixture of experts
Rachel Teo and Tan Minh Nguyen. Momentum SM oe: Integrating momentum into sparse mixture of experts. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=y929esCZNJ
work page 2024
-
[45]
Mo LE x: Mixture of layer experts for fine-tuning with sparse upcycling
Rachel Teo and Tan Minh Nguyen. Mo LE x: Mixture of layer experts for fine-tuning with sparse upcycling. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=rWui9vLhOc
work page 2025
-
[46]
Unveiling the hidden structure of self-attention via kernel principal component analysis
Rachel SY Teo and Tan Nguyen. Unveiling the hidden structure of self-attention via kernel principal component analysis. Advances in Neural Information Processing Systems, 37: 0 101393--101427, 2025 b
work page 2025
-
[47]
Adversarial risk and the dangers of evaluating against weak attacks
Jonathan Uesato, Brendan O’donoghue, Pushmeet Kohli, and Aaron Oord. Adversarial risk and the dangers of evaluating against weak attacks. In International conference on machine learning, pp.\ 5025--5034. PMLR, 2018
work page 2018
-
[48]
Clustering n objects into k groups under optimal scaling of variables
Stef Van Buuren and Willem J Heiser. Clustering n objects into k groups under optimal scaling of variables. Psychometrika, 54: 0 699--706, 1989
work page 1989
-
[49]
A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017
work page 2017
-
[50]
A framework for feature selection in clustering
Daniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105 0 (490): 0 713--726, 2010
work page 2010
-
[51]
Robust mixture-of-expert training for convolutional neural networks
Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, and Sijia Liu. Robust mixture-of-expert training for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 90--101, 2023
work page 2023
-
[52]
Understanding the robustness in vision transformers
Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. In International Conference on Machine Learning, pp.\ 27378--27394. PMLR, 2022 a
work page 2022
-
[53]
Mixture-of-experts with expert choice routing
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35: 0 7103--7114, 2022 b
work page 2022
-
[54]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[55]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[56]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[57]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.