Recognition: unknown
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
Pith reviewed 2026-05-08 11:51 UTC · model grok-4.3
The pith
A single shared expert pool improves MoE loss and allows expert parameters to grow sublinearly with depth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniPool replaces per-layer expert ownership with a single shared expert pool accessed by independent per-layer routers. A pool-level auxiliary loss maintains balanced utilization across the entire pool, and NormRouter supplies scale-stable sparse routing. Across five LLaMA-architecture scales trained on 30B tokens from the Pile, this yields up to 0.0386 lower validation loss than matched vanilla MoE. Reduced-pool versions using only 41.6%-66.7% of the vanilla expert-parameter budget still match or outperform layer-wise MoE, showing expert parameters can grow sublinearly with depth while remaining more efficient.
What carries the argument
A globally shared expert pool accessed by per-layer routers, stabilized by a pool-level auxiliary loss and NormRouter.
If this is right
- Validation loss and perplexity improve over vanilla MoE at every tested scale from 182M to 978M parameters.
- Expert parameters can scale sublinearly with depth while still matching or beating layer-wise allocation.
- Pool size becomes an explicit, tunable hyperparameter for depth scaling rather than being fixed by layer count.
- UniPool gains compose with finer-grained expert decomposition techniques.
Where Pith is reading between the lines
- Expert specialization appears less layer-specific than standard designs assume, since deeper layers tolerate shared capacity without large accuracy loss.
- Memory for expert weights can be stored once rather than replicated per layer, lowering the overall parameter footprint.
- The architecture supports greater model depth without forcing proportional growth in expert compute or memory.
Load-bearing premise
A single pool-level auxiliary loss plus NormRouter is sufficient to maintain balanced expert utilization and stable gradients when the same experts are accessed by routers at every depth.
What would settle it
Training a UniPool model at any of the tested scales with a reduced pool size and finding that its validation loss and perplexity exceed those of the matched vanilla MoE baseline would falsify the sublinear-scaling claim.
Figures
read the original abstract
Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes UniPool, replacing per-layer expert sets in MoE with a single globally shared expert pool accessed by independent per-layer routers. Motivated by a routing probe showing that random routing in deeper layers causes only minor accuracy drops, it introduces a pool-level auxiliary loss for utilization balance and adopts NormRouter for stable sparse routing. On five LLaMA scales (182M–978M parameters) trained on 30B Pile tokens, UniPool yields consistent validation loss reductions (up to 0.0386) over vanilla MoE baselines; reduced-pool variants using 41.6–66.7% of the expert-parameter budget match or exceed layer-wise MoE performance, indicating sublinear expert scaling with depth.
Significance. If the empirical results hold under scrutiny, the work is significant for demonstrating that expert capacity need not scale linearly with depth in MoE architectures, potentially improving parameter efficiency in large models. The multi-scale validation and the framing of pool size as an explicit hyperparameter are strengths, as is the note that benefits compose with finer-grained expert decomposition. The approach challenges a core convention in MoE design with concrete efficiency gains.
major comments (2)
- [Methods (auxiliary loss and NormRouter)] The pool-level auxiliary loss (described in the methods) equalizes aggregate expert counts but provides no mechanism or analysis to ensure per-layer router decisions produce non-conflicting activation patterns or stable gradients across depths when experts are fully shared. This assumption is load-bearing for the claim that reduced-pool UniPool remains stable and effective.
- [Experiments and Results] Results section and abstract: the reported loss reductions and reduced-pool matches lack error bars, training curves, or ablations on the auxiliary-loss coefficient and pool-size selection procedure. Without these, it is difficult to determine whether the consistent gains across the five scales are robust or sensitive to hyperparameter choices.
minor comments (2)
- [Introduction] The routing-probe result (1.0–1.6 point drop) is cited as motivation but would benefit from a brief table or figure showing the exact models and tasks used.
- [Abstract and Experiments] Notation for the shared pool size versus per-layer expert count could be made more explicit when comparing parameter budgets.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense based on our work while noting where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [Methods (auxiliary loss and NormRouter)] The pool-level auxiliary loss (described in the methods) equalizes aggregate expert counts but provides no mechanism or analysis to ensure per-layer router decisions produce non-conflicting activation patterns or stable gradients across depths when experts are fully shared. This assumption is load-bearing for the claim that reduced-pool UniPool remains stable and effective.
Authors: The pool-level auxiliary loss balances aggregate utilization across the shared pool, and NormRouter is adopted specifically to stabilize sparse routing at scale. The routing probe in the introduction provides supporting evidence that deeper layers exhibit substantial redundancy, with random routing causing only minor drops, which motivates the feasibility of shared experts. Our results across five scales show consistent training stability and performance gains with reduced pools, indicating that independent per-layer routers learn compatible patterns in practice. To directly address the request for mechanism analysis, we will add a new subsection with per-layer activation correlation statistics in the revised manuscript. revision: partial
-
Referee: [Experiments and Results] Results section and abstract: the reported loss reductions and reduced-pool matches lack error bars, training curves, or ablations on the auxiliary-loss coefficient and pool-size selection procedure. Without these, it is difficult to determine whether the consistent gains across the five scales are robust or sensitive to hyperparameter choices.
Authors: We agree that error bars, curves, and targeted ablations would improve clarity on robustness. The consistency of gains across five distinct scales already serves as a form of multi-run validation, but we will revise the results section and appendix to include error bars from repeated seeds where compute permits, representative training curves, and ablations on the auxiliary loss coefficient as well as pool-size selection heuristics. revision: yes
Circularity Check
No significant circularity in empirical performance claims
full rationale
The paper presents an empirical architecture (UniPool) and reports measured validation loss/perplexity improvements across five model scales trained on 30B tokens. No derivation chain, first-principles result, or prediction is claimed that reduces by the paper's own equations to fitted inputs or self-citations. The routing probe is an experiment whose outcome is reported rather than used to derive the final metrics. The auxiliary loss and NormRouter are design choices whose effects are validated by direct training runs, not by construction. This is the common case of a self-contained empirical paper.
Axiom & Free-Parameter Ledger
free parameters (2)
- pool size
- auxiliary loss coefficient
axioms (1)
- domain assumption Transformer layers can be trained with standard optimizers and token-level cross-entropy when experts are shared across depth.
invented entities (2)
-
UniPool shared expert pool
no independent evidence
-
NormRouter
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Sikai Bai, Haoxi Li, Jie Zhang, Zicong Hong, and Song Guo. Diep: Adaptive mixture-of-experts compression through differentiable expert pruning.arXiv preprint arXiv:2509.16105, 2025
-
[2]
PIQA: Reasoning about physical intuition by question answering
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical intuition by question answering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020
2020
-
[3]
Unified scaling laws for routed language models
Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Sherrington, Mia Saber, Jay Sherburn, Jean Sherrington, Michael Sherrington, et al. Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169, 2022
-
[4]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936, 2019
2019
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review arXiv 2018
-
[6]
Man- ning
Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, and Christopher D. Man- ning. MoEUT: Mixture-of-experts universal transformers. InAdvances in Neural Information Processing Systems, 2024
2024
-
[7]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wenge Zeng, Xingkai Yu, Y . Wu, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
Uni- versal transformers
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. InInternational Conference on Learning Representations, 2019
2019
-
[11]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022
2022
-
[12]
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological cybernetics, 36(4):193–202, 1980
Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological cybernetics, 36(4):193–202, 1980
1980
-
[13]
Visual feature extraction by a multilayered network of analog threshold elements.IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, 2007
Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements.IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, 2007
2007
-
[14]
MegaBlocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5, 2023
Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. MegaBlocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5, 2023
2023
-
[15]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review arXiv 2020
-
[16]
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories.arXiv preprint arXiv:2012.14913, 2020
work page internal anchor Pith review arXiv 2012
-
[17]
arXiv preprint arXiv:2407.04153 , year=
Xu Owen He. Mixture of a million experts.arXiv preprint arXiv:2407.04153, 2024
-
[18]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 11
work page internal anchor Pith review arXiv 2016
-
[19]
A theory of steady-state activity in nerve-fiber networks: I
Alston S Householder. A theory of steady-state activity in nerve-fiber networks: I. definitions and preliminary lemmas.The bulletin of mathematical biophysics, 3(2):63–69, 1941
1941
-
[20]
Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, and Li Shang
Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, et al. Sd-moe: Spectral decomposition for effective expert specialization.arXiv preprint arXiv:2602.12556, 2026
-
[21]
Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5, 2023
Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Parijat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5, 2023
2023
-
[22]
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991. doi: 10.1162/neco.1991.3.1. 79
-
[23]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review arXiv 2001
-
[25]
arXiv preprint arXiv:2402.07871 , year=
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Piontkowski, Piotr Piotrowski, Szymon Antoniak, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024
-
[26]
RACE: Large-scale reading comprehension dataset from examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, 2017
2017
-
[27]
ALBERT: A lite BERT for self-supervised learning of language representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations, 2020
2020
-
[28]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2021
work page internal anchor Pith review arXiv 2006
-
[29]
BASE layers: Simplifying training of large, sparse models
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. BASE layers: Simplifying training of large, sparse models. InInternational Conference on Machine Learning, 2021
2021
-
[30]
Towards a unified view of sparse feed-forward network in pretraining large language model
Zeyu Liu, Tim Dettmers, Xi Lin, Veselin Stoyanov, and Xian Li. Towards a unified view of sparse feed-forward network in pretraining large language model. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15038–15061, Singapore, December 2023. Association for ...
-
[31]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2019
work page internal anchor Pith review arXiv 2019
-
[32]
Olmoe: Open mixture-of-experts language models
Niklas Muennighoff, Luca Soldaini Yang, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Peter Izsak, et al. OLMoE: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060, 2024
-
[33]
On estimating regression.Theory of Probability & Its Applications, 9(1): 141–142, 1964
Elizbar A Nadaraya. On estimating regression.Theory of Probability & Its Applications, 9(1): 141–142, 1964
1964
-
[34]
Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th International Conference on Machine Learning, pages 807–814. Omnipress, 2010. 12
2010
-
[35]
Huy Nguyen, Nhat Ho, and Alessandro Rinaldo. On least square estimation in softmax gating mixture of experts.arXiv preprint arXiv:2402.02952, 2024
-
[36]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Dufter, Quan Pham, Raffaella Bernardi, and Marco Baroni. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1525–1534, 2016
2016
-
[37]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
2019
-
[38]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446, 2021
work page internal anchor Pith review arXiv 2021
-
[39]
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture- of-experts inference and training to power next-generation AI scale.arXiv preprint arXiv:2201.05596, 2022
-
[40]
Hash layers for large sparse models.Advances in Neural Information Processing Systems, 34, 2021
Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse models.Advances in Neural Information Processing Systems, 34, 2021
2021
-
[41]
Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
2021
-
[42]
GLU Variants Improve Transformer
Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review arXiv 2002
-
[43]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review arXiv 2017
-
[45]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2020
work page internal anchor Pith review arXiv 1909
-
[46]
Dolma: an open corpus of three trillion tokens for language model pretraining research
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...
-
[47]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[48]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 13
work page internal anchor Pith review arXiv 2025
-
[49]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review arXiv 2023
-
[50]
Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
2017
-
[51]
Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,
Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024
-
[52]
Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964
Geoffrey S Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964
1964
-
[53]
Data mining and knowledge discovery , 33(4):917–963
Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors,Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URLhttps:/...
-
[54]
Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, and Li Yuan. Sere: Similarity-based expert re-routing for efficient batch decoding in moe models.arXiv preprint arXiv:2602.07616, 2026
-
[55]
Sheared LLaMA: Accelerating language model pre-training via structured pruning
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared LLaMA: Accelerating language model pre-training via structured pruning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=09iOdaeOzp
2024
-
[56]
Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024
-
[57]
HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019
2019
-
[58]
Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019
2019
-
[59]
TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385, 2024
work page internal anchor Pith review arXiv 2024
-
[60]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review arXiv 2022
-
[61]
Calibrate before use: Improving few-shot performance of language models
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pages 12697–12706. PMLR, 2021
2021
-
[62]
Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, et al. Understanding the mixture-of-experts with nadaraya-watson kernel.arXiv preprint arXiv:2509.25913, 2025
-
[63]
Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025
-
[64]
Mixture-of-experts with expert choice routing
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, and James Laudon. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35, 2022
2022
-
[65]
LLaMA-MoE: Building mixture-of-experts from LLaMA with continual pre-training
Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. LLaMA-MoE: Building mixture-of-experts from LLaMA with continual pre-training. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15913–15923, Miami, Florida, USA, N...
-
[66]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. 15 A Limitations and Future Work Scale of experiments.Our experiments are conducted at 182M–978M parameter scales with 30B training tokens. While th...
work page internal anchor Pith review arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.