Recognition: 3 theorem links
· Lean TheoremAdaptive Inverted-Index Routing for Granular Mixtures-of-Experts
Pith reviewed 2026-05-08 17:52 UTC · model grok-4.3
The pith
AIR-MoE approximates top-k expert routing in granular mixtures using vector quantization shortlisting followed by fine scoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AIR-MoE performs coarse shortlisting by assigning tokens to VQ codewords to construct a candidate set of experts, then computes exact routing scores restricted to this shortlist. This two-stage procedure approximates true top-k routing while avoiding full expert scoring and imposing no structural constraints on expert parameters. It serves as a drop-in replacement requiring no modifications to the model architecture or loss function, and a lower bound on mass recall is provided.
What carries the argument
The two-stage adaptive inverted-index routing mechanism, where vector quantization codewords enable coarse candidate set construction for subsequent fine scoring.
If this is right
- Granular MoE models can scale to more experts without routing computation becoming prohibitive.
- The method maintains compatibility with standard training procedures since no architectural or loss changes are needed.
- Performance gains are observed over existing routing approaches in high-granularity regimes.
- The mass recall lower bound offers a way to analyze and tune the shortlisting stage.
Where Pith is reading between the lines
- Similar inverted-index techniques might apply to other sparse selection problems beyond MoE.
- Increasing the number of VQ codewords could trade off shortlist size against recall quality in practice.
- Testing on even larger models would reveal if the approximation remains effective as model scale grows.
Load-bearing premise
The vector quantization codewords must generate candidate sets with sufficiently high mass recall so that the fine scoring stage can recover near-optimal experts.
What would settle it
Observing a case where the shortlist recall falls significantly below the derived lower bound or where the final model performance underperforms a true top-k router on identical expert setups.
Figures
read the original abstract
Mixture-of-experts (MoE) models enable scalable transformer architectures by activating only a subset of experts per token. Recent evidence suggests that performance improves with increasingly granular experts, i.e., many small experts instead of a few large ones. However, this regime substantially increases routing cost, which can dominate computation. We introduce adaptive inverted-index routing for MoE (AIR-MoE), an inverted-index-inspired routing architecture based on vector quantization (VQ). In a first stage, AIR-MoE performs coarse shortlisting by assigning tokens to VQ codewords to construct a candidate set of experts. In a second stage, fine scoring computes exact routing scores restricted to this shortlist. This two-stage procedure approximates true top-k routing while avoiding full expert scoring and, in contrast to prior work, imposing no structural constraints on expert parameters. AIR-MoE serves as a drop-in replacement for standard routers and requires no modifications to the model architecture or loss function. We further provide a lower bound on the mass recall achieved by AIR-MoE that yields insights into its inner workings. Empirically, we demonstrate that AIR-MoE achieves improved performance compared to existing routing approaches in granular MoE settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AIR-MoE, a two-stage routing architecture for granular Mixture-of-Experts models. The first stage uses vector quantization to assign tokens to codewords and construct a shortlist of candidate experts; the second stage performs exact fine-grained scoring only on this shortlist. The approach is presented as a drop-in replacement for standard routers that approximates true top-k routing without structural constraints on expert parameters or changes to the model/loss. A lower bound on the mass recall of the VQ shortlisting stage is derived, and empirical results are reported showing improved performance over existing routing methods in granular MoE regimes.
Significance. If the mass-recall lower bound is tight enough in practice and the reported gains are reproducible with proper controls, the work would be significant for scaling MoE transformers to finer expert granularity. The drop-in compatibility and avoidance of expert-parameter constraints distinguish it from prior constrained routers, potentially enabling broader adoption in large-scale training.
major comments (2)
- [Abstract] Abstract: the central claim that AIR-MoE approximates true top-k routing rests on the VQ shortlist achieving sufficiently high mass recall for the fine-scoring stage to recover near-optimal experts. The abstract states that a lower bound is derived but provides neither the bound itself, its dependence on codebook size or granularity, nor any tightness analysis or empirical mass-recall values; without these the approximation quality asserted in the strongest claim cannot be evaluated.
- [Empirical Evaluation] Empirical section (referenced in abstract): the abstract reports performance improvements over existing routers but supplies no experimental details, baselines, error bars, number of runs, or ablation on codebook size. This absence makes it impossible to verify whether the gains are attributable to the two-stage procedure or to other factors, directly affecting the soundness of the performance claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below. Where the comments identify opportunities for greater clarity, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that AIR-MoE approximates true top-k routing rests on the VQ shortlist achieving sufficiently high mass recall for the fine-scoring stage to recover near-optimal experts. The abstract states that a lower bound is derived but provides neither the bound itself, its dependence on codebook size or granularity, nor any tightness analysis or empirical mass-recall values; without these the approximation quality asserted in the strongest claim cannot be evaluated.
Authors: We agree that the abstract would benefit from explicit reference to the mass-recall lower bound and its properties to better support the central approximation claim. The full manuscript derives this bound in Section 3, establishing its dependence on codebook size and expert granularity. In the revised version we have updated the abstract with a concise statement of the bound and its key dependencies. We have also added empirical mass-recall values together with a tightness discussion in the experimental section so that readers can directly assess approximation quality. revision: yes
-
Referee: [Empirical Evaluation] Empirical section (referenced in abstract): the abstract reports performance improvements over existing routers but supplies no experimental details, baselines, error bars, number of runs, or ablation on codebook size. This absence makes it impossible to verify whether the gains are attributable to the two-stage procedure or to other factors, directly affecting the soundness of the performance claim.
Authors: We acknowledge that the abstract's brevity omits experimental specifics and that the empirical section would be strengthened by additional controls. The manuscript already compares against standard top-k and other routing baselines and includes codebook-size ablations; however, to directly address verifiability we have revised both the abstract and the empirical section to report results averaged over multiple independent runs with error bars, to list all baselines explicitly, and to present the codebook ablation results more prominently. These additions allow clearer attribution of gains to the two-stage procedure. revision: yes
Circularity Check
No circularity: AIR-MoE routing and mass-recall bound are independently derived
full rationale
The paper defines AIR-MoE as a two-stage procedure (VQ-based shortlisting followed by exact fine scoring on the candidate set) and separately derives a lower bound on mass recall from the VQ assignment properties. Neither the performance claim nor the bound reduces to a fitted parameter, self-referential definition, or self-citation chain; the approximation to top-k is presented as a consequence of the (bounded) recall rather than being assumed in the derivation. No load-bearing step matches any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
invented entities (1)
-
AIR-MoE two-stage router
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Cost (Jcost, Jcost_unit0, Jcost_pos_of_ne_one)Jcost / cosh-cost framework unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MassRecall(h) ≥ exp(−2ε(h)) ρ_M(c(h)) ... ε(h) := ‖h − c(h)‖
-
IndisputableMonolith.Foundation.LogicAsFunctionalEquationwashburn_uniqueness_aczel / J-cost forcing unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
vector quantization (Gray, 1984) ... cosine similarity ... adaptive spherical k-means ... EMA updates
-
IndisputableMonolith (whole corpus)n/a (domain disjoint) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
approximate top-K routing for granular MoE; no structural constraints on expert parameters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
To Index or Not to Index: Optimizing Exact Maximum Inner Product Search
Firas Abuzaid, Geet Sethi, Peter Bailis, and Matei Zaharia. To Index or Not to Index: Optimizing Exact Maximum Inner Product Search . International Conference on Data Engineering, pp.\ 1250--1261, 2019
2019
-
[2]
Sampling Techniques for Kernel Methods
Dimitris Achlioptas, Frank McSherry, and Bernhard Sch \"o lkopf. Sampling Techniques for Kernel Methods . Advances in Neural Information Processing Systems, 14, 2001
2001
-
[3]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints . Empirical Methods in Natural Language Processing, pp.\ 4895--4901, 2023
2023
-
[4]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization . arXiv preprint arXiv:1607.06450, 2016
work page Pith review arXiv 2016
-
[5]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas L \'e onard, and Aaron Courville. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation . arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review arXiv 2013
-
[6]
Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the Pile . arXiv preprint arXiv:2201.07311, 2022
-
[7]
On the Representation Collapse of Sparse Mixture of Experts
Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Bo Li, Saksham Singhal, Prakhar Bajaj, Xia Song, and Furu Wei. On the Representation Collapse of Sparse Mixture of Experts . Advances in Neural Information Processing Systems, 2022
2022
-
[8]
Unified Scaling Laws for Routed Language Models
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified Scaling Laws for Routed Language Models . International Conference on Machine Learning, pp.\ 4057--4086, 2022
2022
-
[9]
Common crawl creative commons corpus (c5)
Common Crawl Foundation . Common crawl creative commons corpus (c5). Common Crawl, 2025. Available at https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons, accessed 2025-08-11
2025
-
[10]
Dhillon and Dharmendra S
Inderjit S. Dhillon and Dharmendra S. Modha. Concept Decompositions for Large Sparse Text Data Using Clustering . Machine Learning, 42 0 (1): 0 143--175, 2001
2001
-
[11]
On the Benefits of Learning to Route in Mixture-of-Experts Models
Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Panigrahy, Nikhil Vyas, and Xin Wang. On the Benefits of Learning to Route in Mixture-of-Experts Models . Conference on Empirical Methods in Natural Language Processing, pp.\ 9376--9396, 2023
2023
-
[12]
On the Role of Discrete Representation in Sparse Mixture of Experts
Giang Do, Kha Pham, Hung Le, and Truyen Tran. On the Role of Discrete Representation in Sparse Mixture of Experts . arXiv preprint arXiv:2411.19402, 2024
-
[13]
Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-VL Technical Report . arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review arXiv 2025
-
[14]
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts . International Conference on Machine Learning, pp.\ 5547--5569, 2022
2022
-
[15]
Taming Transformers for High-Resolution Image Synthesis
Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis . Conference on Computer Vision and Pattern Recognition, pp.\ 12873--12883, 2021
2021
-
[16]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity . Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022
2022
-
[17]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review arXiv 2020
-
[18]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[19]
Vector Quantization
Robert Gray. Vector Quantization . IEEE Assp Magazine, 1 0 (2): 0 4--29, 1984
1984
-
[20]
arXiv preprint arXiv:2407.04153 , year=
Xu Owen He. Mixture of a Million Experts . arXiv preprint arXiv:2407.04153, 2024
-
[21]
Approximate nearest neighbors: towards removing the curse of dimensionality
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality . ACM Symposium on Theory of Computing, pp.\ 604--613, 1998
1998
-
[22]
Jacobs, Michael I
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive Mixtures of Local Experts . Neural Computation, 3 0 (1): 0 79--87, 1991
1991
-
[23]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of Experts . arXiv preprint arXiv:2401.04088, 2024
work page Pith review arXiv 2024
-
[24]
Billion-scale similarity search with GPUs
Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. Billion-scale similarity search with GPUs . IEEE Transactions on Big Data, 7 0 (3): 0 535--547, 2019
2019
-
[25]
Jordan and Robert A
Michael I. Jordan and Robert A. Jacobs. Hierarchies of Adaptive Experts . Advances in Neural Information Processing Systems, 4, 1991
1991
-
[26]
Jordan and Robert A
Michael I. Jordan and Robert A. Jacobs. Hierarchical Mixtures of Experts and the EM Algorithm . Neural Computation, 6 0 (2): 0 181--214, 1994
1994
-
[27]
Scaling Laws for Fine-Grained Mixture of Experts
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi \'o ro, Micha Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr \'o l, Tomasz Odrzyg \'o \'z d \'z , Piotr Sankowski, et al. Scaling Laws for Fine-Grained Mixture of Experts . International Conference on Machine Learning, 2024
2024
-
[28]
Large Memory Layers with Product Keys
Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, and Herv \'e J \'e gou. Large Memory Layers with Product Keys . Advances in Neural Information Processing Systems, 32, 2019
2019
-
[29]
Evaluation of text clustering methods and their dataspace embeddings: an exploration
Alain Lelu and Martine Cadot. Evaluation of text clustering methods and their dataspace embeddings: an exploration. International Federation of Classification Societies, pp.\ 131--139, 2019
2019
-
[30]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding . International Conference on Learning Representations, 2021
2021
-
[31]
An Algorithm for Vector Quantizer Design
Yoseph Linde, Andres Buzo, and Robert Gray. An Algorithm for Vector Quantizer Design . IEEE Transactions on communications, 28 0 (1): 0 84--95, 1980
1980
-
[32]
James B. McQueen. Some Methods of Classification and Analysis of Multivariate Observations . Proc. of 5th Berkeley Symposium on Math. Stat. and Prob., pp.\ 281--297, 1967
1967
-
[33]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture Models . International Conference on Learning Representations, 2017
2017
-
[34]
Statistical advantages of perturbing cosine router in sparse mixture of experts
Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang, and Nhat Ho. Statistical advantages of perturbing cosine router in sparse mixture of experts. arXiv preprint arXiv:2405.14131, 6, 2024
-
[35]
Memory Augmented Language Models through Mixture of Word Experts
Cicero Nogueira Dos Santos, James Lee-Thorp, Isaac Noble, Chung-Ching Chang, and David Uthus. Memory Augmented Language Models through Mixture of Word Experts . North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 4425--4438, 2024
2024
-
[36]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020
2020
-
[37]
Generating Diverse High-Fidelity Images with VQ-VAE-2
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-VAE-2 . Advances in Neural Information Processing Systems, 32, 2019
2019
-
[38]
Hash Layers For Large Sparse Models
Stephen Roller, Sainbayar Sukhbaatar, and Jason Weston. Hash Layers For Large Sparse Models . Advances in Neural Information Processing Systems, 34: 0 17555--17566, 2021
2021
-
[39]
Fox, and Harry Wu
Gerard Salton, Edward A. Fox, and Harry Wu. Extended Boolean Information Retrieval . Communications of the ACM, 26 0 (11): 0 1022--1036, 1983
1983
-
[40]
GLU Variants Improve Transformer
Noam Shazeer. GLU Variants Improve Transformer . arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review arXiv 2002
-
[41]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer . International Conference on Learning Representations, 2017
2017
-
[42]
Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips)
Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). Advances in Neural Information Processing Systems, 27, 2014
2014
-
[43]
Video Google: A text retrieval approach to object matching in videos
Sivic and Zisserman. Video Google: A text retrieval approach to object matching in videos . International Conference on Computer Vision, pp.\ 1470--1477, 2003
2003
-
[44]
RoFormer: Enhanced transformer with Rotary Position Embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding . Neurocomputing, 568: 0 127063, 2024 a
2024
-
[45]
Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou, et al. MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts . arXiv preprint arXiv:2407.09816, 2024 b
-
[46]
Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity . arXiv preprint arXiv:2505.21411, 2025
-
[47]
Neural Discrete Representation Learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural Discrete Representation Learning . Advances in Neural Information Processing Systems, 30, 2017
2017
-
[48]
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention Is All You Need . Advances in Neural Information Processing Systems, 30, 2017
2017
-
[49]
A Neural Corpus Indexer for Document Retrieval
Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al. A Neural Corpus Indexer for Document Retrieval . Advances in Neural Information Processing Systems, 35: 0 25600--25614, 2022
2022
-
[50]
Hierarchical Quantized Autoencoders
Will Williams, Sam Ringer, Tom Ash, David MacLeod, Jamie Dougherty, and John Hughes. Hierarchical Quantized Autoencoders . Advances in Neural Information Processing Systems, 33: 0 4524--4535, 2020
2020
-
[51]
MoEC: Mixture of Expert Clusters
Yuan Xie, Shaohan Huang, Tianyu Chen, and Furu Wei. MoEC: Mixture of Expert Clusters . AAAI Conference on Artificial Intelligence, 37 0 (11): 0 13807--13815, 2023
2023
-
[52]
Mixture-of-Experts with Expert Choice Routing
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-Experts with Expert Choice Routing . Advances in Neural Information Processing Systems, 35: 0 7103--7114, 2022
2022
-
[53]
Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido Zuccon, and Daxin Jiang. Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation . arXiv preprint arXiv:2206.10128, 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.