arxiv: 2605.06665 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Minbin Huang , Han Shi , Chuanyang Zheng , Yimeng Wu , Guoxuan Chen , Xintong Yu , Yichun Yin , Hong Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixture of expertsshared expert poolparameter efficiencydepth scalingLLaMA modelsauxiliary lossrouting stabilityvalidation loss

0 comments

The pith

A single shared expert pool improves MoE loss and allows expert parameters to grow sublinearly with depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard Mixture-of-Experts models assign each transformer layer its own dedicated experts, so expert parameters increase linearly as models get deeper. This paper replaces that rule with UniPool, a design that keeps one global pool of experts and lets every layer's router pick from the same pool. A pool-level auxiliary loss balances how often each expert is used, while NormRouter keeps the sparse selections stable across layers. Experiments on five LLaMA-style models from 182M to 978M parameters trained on 30B tokens show consistent drops in validation loss compared with conventional per-layer MoE. Even when the shared pool is shrunk to 41.6-66.7% of the usual expert budget, performance matches or exceeds the baseline, demonstrating that expert capacity need not scale linearly with depth.

Core claim

UniPool replaces per-layer expert ownership with a single shared expert pool accessed by independent per-layer routers. A pool-level auxiliary loss maintains balanced utilization across the entire pool, and NormRouter supplies scale-stable sparse routing. Across five LLaMA-architecture scales trained on 30B tokens from the Pile, this yields up to 0.0386 lower validation loss than matched vanilla MoE. Reduced-pool versions using only 41.6%-66.7% of the vanilla expert-parameter budget still match or outperform layer-wise MoE, showing expert parameters can grow sublinearly with depth while remaining more efficient.

What carries the argument

A globally shared expert pool accessed by per-layer routers, stabilized by a pool-level auxiliary loss and NormRouter.

If this is right

Validation loss and perplexity improve over vanilla MoE at every tested scale from 182M to 978M parameters.
Expert parameters can scale sublinearly with depth while still matching or beating layer-wise allocation.
Pool size becomes an explicit, tunable hyperparameter for depth scaling rather than being fixed by layer count.
UniPool gains compose with finer-grained expert decomposition techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Expert specialization appears less layer-specific than standard designs assume, since deeper layers tolerate shared capacity without large accuracy loss.
Memory for expert weights can be stored once rather than replicated per layer, lowering the overall parameter footprint.
The architecture supports greater model depth without forcing proportional growth in expert compute or memory.

Load-bearing premise

A single pool-level auxiliary loss plus NormRouter is sufficient to maintain balanced expert utilization and stable gradients when the same experts are accessed by routers at every depth.

What would settle it

Training a UniPool model at any of the tested scales with a reduced pool size and finding that its validation loss and perplexity exceed those of the matched vanilla MoE baseline would falsify the sublinear-scaling claim.

Figures

Figures reproduced from arXiv: 2605.06665 by Chuanyang Zheng, Guoxuan Chen, Han Shi, Hong Cheng, Minbin Huang, Xintong Yu, Yichun Yin, Yimeng Wu.

**Figure 1.** Figure 1: UNIPOOL overview. Vanilla MoE allocates a private expert set to each transformer layer, tying expert parameters to depth and preventing cross-layer reuse. UNIPOOL replaces layer-private ownership with a single global expert pool while keeping independent per-layer routers. Pool-level balancing aggregates utilization over the shared pool, preventing globally unused experts without forcing every layer to use… view at source ↗

**Figure 3.** Figure 3: Expert utilization at the 182M scale: per-layer auxiliary loss leads to global expert collapse, view at source ↗

**Figure 4.** Figure 4: Validation loss curves. Panels (a)–(c) compare UNIPOOL with vanilla MoE at 182M, 469M, and 650M over 30B Pile tokens. Panel (d) shows the 182M sharing-scope ablation, where G=1 is full UNIPOOL and G=12 is vanilla MoE; grouped configurations interpolate between the endpoints. D Hyperparameter Details view at source ↗

read the original abstract

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniPool's shared global expert pool plus pool-level balancing loss lets models use 40-67% fewer expert parameters while beating standard per-layer MoE on validation loss across five scales.

read the letter

The main thing to know is that this paper replaces the usual per-layer expert sets with one shared pool that every layer's router pulls from, and the experiments show this cuts total expert parameters while still lowering validation loss on LLaMA-style models trained on 30B Pile tokens. The gains hold at scales from 182M up to 978M parameters, with the biggest reported drop at 0.0386 in loss, and even the reduced-pool versions match or beat the matched vanilla MoE baselines. That sublinear expert growth is the practical hook for anyone scaling sparse models. What the work actually adds is the explicit global pool design, the pool-level auxiliary loss to keep utilization even, and the NormRouter choice for stable sparse routing into the shared set. The random-routing probe on deeper layers gives a clean motivation for why per-layer isolation might be overkill. Those pieces together produce the reported improvements and the parameter savings. The results look reproducible enough on the surface because they run the same training setup across five different model sizes and report consistent direction. On the soft spots, the balance story depends on whether one global auxiliary loss plus NormRouter really prevents misaligned gradients when the same experts get activated by routers at every depth. The abstract does not include per-layer utilization plots or ablations on the loss weight, so it is not yet clear how fragile the gains are outside these exact scales. No error bars appear in the reported numbers either. If the full paper supplies those checks and shows the experts stay useful across layers, the central claim strengthens; otherwise the reduced-pool results could be narrower than they first appear. This paper is for researchers who already work on MoE efficiency and want a concrete alternative to layer-wise expert allocation. A reader who cares about parameter budgets in large sparse transformers will find the empirical comparison useful. It deserves a serious referee because the architecture change is simple to implement and the scaling experiments are run at multiple sizes with a clear baseline.

Referee Report

2 major / 2 minor

Summary. The paper proposes UniPool, replacing per-layer expert sets in MoE with a single globally shared expert pool accessed by independent per-layer routers. Motivated by a routing probe showing that random routing in deeper layers causes only minor accuracy drops, it introduces a pool-level auxiliary loss for utilization balance and adopts NormRouter for stable sparse routing. On five LLaMA scales (182M–978M parameters) trained on 30B Pile tokens, UniPool yields consistent validation loss reductions (up to 0.0386) over vanilla MoE baselines; reduced-pool variants using 41.6–66.7% of the expert-parameter budget match or exceed layer-wise MoE performance, indicating sublinear expert scaling with depth.

Significance. If the empirical results hold under scrutiny, the work is significant for demonstrating that expert capacity need not scale linearly with depth in MoE architectures, potentially improving parameter efficiency in large models. The multi-scale validation and the framing of pool size as an explicit hyperparameter are strengths, as is the note that benefits compose with finer-grained expert decomposition. The approach challenges a core convention in MoE design with concrete efficiency gains.

major comments (2)

[Methods (auxiliary loss and NormRouter)] The pool-level auxiliary loss (described in the methods) equalizes aggregate expert counts but provides no mechanism or analysis to ensure per-layer router decisions produce non-conflicting activation patterns or stable gradients across depths when experts are fully shared. This assumption is load-bearing for the claim that reduced-pool UniPool remains stable and effective.
[Experiments and Results] Results section and abstract: the reported loss reductions and reduced-pool matches lack error bars, training curves, or ablations on the auxiliary-loss coefficient and pool-size selection procedure. Without these, it is difficult to determine whether the consistent gains across the five scales are robust or sensitive to hyperparameter choices.

minor comments (2)

[Introduction] The routing-probe result (1.0–1.6 point drop) is cited as motivation but would benefit from a brief table or figure showing the exact models and tasks used.
[Abstract and Experiments] Notation for the shared pool size versus per-layer expert count could be made more explicit when comparing parameter budgets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense based on our work while noting where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Methods (auxiliary loss and NormRouter)] The pool-level auxiliary loss (described in the methods) equalizes aggregate expert counts but provides no mechanism or analysis to ensure per-layer router decisions produce non-conflicting activation patterns or stable gradients across depths when experts are fully shared. This assumption is load-bearing for the claim that reduced-pool UniPool remains stable and effective.

Authors: The pool-level auxiliary loss balances aggregate utilization across the shared pool, and NormRouter is adopted specifically to stabilize sparse routing at scale. The routing probe in the introduction provides supporting evidence that deeper layers exhibit substantial redundancy, with random routing causing only minor drops, which motivates the feasibility of shared experts. Our results across five scales show consistent training stability and performance gains with reduced pools, indicating that independent per-layer routers learn compatible patterns in practice. To directly address the request for mechanism analysis, we will add a new subsection with per-layer activation correlation statistics in the revised manuscript. revision: partial
Referee: [Experiments and Results] Results section and abstract: the reported loss reductions and reduced-pool matches lack error bars, training curves, or ablations on the auxiliary-loss coefficient and pool-size selection procedure. Without these, it is difficult to determine whether the consistent gains across the five scales are robust or sensitive to hyperparameter choices.

Authors: We agree that error bars, curves, and targeted ablations would improve clarity on robustness. The consistency of gains across five distinct scales already serves as a form of multi-run validation, but we will revise the results section and appendix to include error bars from repeated seeds where compute permits, representative training curves, and ablations on the auxiliary loss coefficient as well as pool-size selection heuristics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical performance claims

full rationale

The paper presents an empirical architecture (UniPool) and reports measured validation loss/perplexity improvements across five model scales trained on 30B tokens. No derivation chain, first-principles result, or prediction is claimed that reduces by the paper's own equations to fitted inputs or self-citations. The routing probe is an experiment whose outcome is reported rather than used to derive the final metrics. The auxiliary loss and NormRouter are design choices whose effects are validated by direct training runs, not by construction. This is the common case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The work rests on standard transformer and MoE training assumptions plus two new design choices (global pool and pool-level loss) whose stability is asserted rather than derived.

free parameters (2)

pool size
Explicitly varied as a depth-scaling hyperparameter; reduced-pool variants are compared to full vanilla budgets.
auxiliary loss coefficient
Weight of the pool-level balancing loss; required for stable training but value not stated in abstract.

axioms (1)

domain assumption Transformer layers can be trained with standard optimizers and token-level cross-entropy when experts are shared across depth.
Implicit in all reported training runs.

invented entities (2)

UniPool shared expert pool no independent evidence
purpose: Single global set of experts accessed by independent per-layer routers.
Core architectural change; no independent evidence outside the reported training runs.
NormRouter no independent evidence
purpose: Sparse and scale-stable routing into the shared pool.
Adopted to enable stable training; no external validation supplied.

pith-pipeline@v0.9.0 · 5660 in / 1539 out tokens · 54422 ms · 2026-05-08T11:51:46.361585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 38 canonical work pages · 20 internal anchors

[1]

Diep: Adaptive mixture-of-experts compression through differentiable expert pruning.arXiv preprint arXiv:2509.16105, 2025

Sikai Bai, Haoxi Li, Jie Zhang, Zicong Hong, and Song Guo. Diep: Adaptive mixture-of-experts compression through differentiable expert pruning.arXiv preprint arXiv:2509.16105, 2025

work page arXiv 2025
[2]

PIQA: Reasoning about physical intuition by question answering

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical intuition by question answering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020

2020
[3]

Unified scaling laws for routed language models

Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Sherrington, Mia Saber, Jay Sherburn, Jean Sherrington, Michael Sherrington, et al. Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169, 2022

work page arXiv 2022
[4]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936, 2019

2019
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review arXiv 2018
[6]

Man- ning

Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, and Christopher D. Man- ning. MoEUT: Mixture-of-experts universal transformers. InAdvances in Neural Information Processing Systems, 2024

2024
[7]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wenge Zeng, Xingkai Yu, Y . Wu, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review arXiv 2024
[8]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review arXiv 2024
[9]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review arXiv 2024
[10]

Uni- versal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. InInternational Conference on Learning Representations, 2019

2019
[11]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

2022
[12]

Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological cybernetics, 36(4):193–202, 1980

Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological cybernetics, 36(4):193–202, 1980

1980
[13]

Visual feature extraction by a multilayered network of analog threshold elements.IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, 2007

Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements.IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, 2007

2007
[14]

MegaBlocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5, 2023

Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. MegaBlocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5, 2023

2023
[15]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review arXiv 2020
[16]

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories.arXiv preprint arXiv:2012.14913, 2020

work page internal anchor Pith review arXiv 2012
[17]

arXiv preprint arXiv:2407.04153 , year=

Xu Owen He. Mixture of a million experts.arXiv preprint arXiv:2407.04153, 2024

work page arXiv 2024
[18]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 11

work page internal anchor Pith review arXiv 2016
[19]

A theory of steady-state activity in nerve-fiber networks: I

Alston S Householder. A theory of steady-state activity in nerve-fiber networks: I. definitions and preliminary lemmas.The bulletin of mathematical biophysics, 3(2):63–69, 1941

1941
[20]

Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, and Li Shang

Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, et al. Sd-moe: Spectral decomposition for effective expert specialization.arXiv preprint arXiv:2602.12556, 2026

work page arXiv 2026
[21]

Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5, 2023

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Parijat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5, 2023

2023
[22]

Jacobs, Michael I

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991. doi: 10.1162/neco.1991.3.1. 79

work page doi:10.1162/neco.1991.3.1 1991
[23]

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review arXiv 2024
[24]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review arXiv 2001
[25]

arXiv preprint arXiv:2402.07871 , year=

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Piontkowski, Piotr Piotrowski, Szymon Antoniak, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024

work page arXiv 2024
[26]

RACE: Large-scale reading comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, 2017

2017
[27]

ALBERT: A lite BERT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations, 2020

2020
[28]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2021

work page internal anchor Pith review arXiv 2006
[29]

BASE layers: Simplifying training of large, sparse models

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. BASE layers: Simplifying training of large, sparse models. InInternational Conference on Machine Learning, 2021

2021
[30]

Towards a unified view of sparse feed-forward network in pretraining large language model

Zeyu Liu, Tim Dettmers, Xi Lin, Veselin Stoyanov, and Xian Li. Towards a unified view of sparse feed-forward network in pretraining large language model. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15038–15061, Singapore, December 2023. Association for ...

work page doi:10.18653/v1/2023.emnlp-main.930 2023
[31]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2019

work page internal anchor Pith review arXiv 2019
[32]

Olmoe: Open mixture-of-experts language models

Niklas Muennighoff, Luca Soldaini Yang, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Peter Izsak, et al. OLMoE: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060, 2024

work page arXiv 2024
[33]

On estimating regression.Theory of Probability & Its Applications, 9(1): 141–142, 1964

Elizbar A Nadaraya. On estimating regression.Theory of Probability & Its Applications, 9(1): 141–142, 1964

1964
[34]

Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th International Conference on Machine Learning, pages 807–814. Omnipress, 2010. 12

2010
[35]

On least square estimation in softmax gating mixture of experts.arXiv preprint arXiv:2402.02952, 2024

Huy Nguyen, Nhat Ho, and Alessandro Rinaldo. On least square estimation in softmax gating mixture of experts.arXiv preprint arXiv:2402.02952, 2024

work page arXiv 2024
[36]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Dufter, Quan Pham, Raffaella Bernardi, and Marco Baroni. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1525–1534, 2016

2016
[37]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[38]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review arXiv 2021
[39]

Deepspeed- moe: Advancing mixture-of-experts inference and trainingtopowernext-generationaiscale, 2022

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture- of-experts inference and training to power next-generation AI scale.arXiv preprint arXiv:2201.05596, 2022

work page arXiv 2022
[40]

Hash layers for large sparse models.Advances in Neural Information Processing Systems, 34, 2021

Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse models.Advances in Neural Information Processing Systems, 34, 2021

2021
[41]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021
[42]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review arXiv 2002
[43]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review arXiv 2017
[45]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2020

work page internal anchor Pith review arXiv 1909
[46]

Dolma: an open corpus of three trillion tokens for language model pretraining research

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...

work page doi:10.18653/v1/2024.acl-long.840 2024
[47]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[48]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 13

work page internal anchor Pith review arXiv 2025
[49]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review arXiv 2023
[50]

Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

2017
[51]

Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024

work page arXiv 2024
[52]

Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

Geoffrey S Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

1964
[53]

Data mining and knowledge discovery , 33(4):917–963

Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors,Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URLhttps:/...

work page doi:10.18653/v1/w17-4413 2017
[54]

Sere: Similarity-based expert re-routing for efficient batch decoding in moe models.arXiv preprint arXiv:2602.07616, 2026

Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, and Li Yuan. Sere: Similarity-based expert re-routing for efficient batch decoding in moe models.arXiv preprint arXiv:2602.07616, 2026

work page arXiv 2026
[55]

Sheared LLaMA: Accelerating language model pre-training via structured pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared LLaMA: Accelerating language model pre-training via structured pruning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=09iOdaeOzp

2024
[56]

OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024

work page arXiv 2024
[57]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

2019
[58]

Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

2019
[59]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review arXiv 2024
[60]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review arXiv 2022
[61]

Calibrate before use: Improving few-shot performance of language models

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pages 12697–12706. PMLR, 2021

2021
[62]

Understanding the mixture-of-experts with nadaraya-watson kernel.arXiv preprint arXiv:2509.25913, 2025

Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, et al. Understanding the mixture-of-experts with nadaraya-watson kernel.arXiv preprint arXiv:2509.25913, 2025

work page arXiv 2025
[63]

Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

work page arXiv 2025
[64]

Mixture-of-experts with expert choice routing

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, and James Laudon. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35, 2022

2022
[65]

LLaMA-MoE: Building mixture-of-experts from LLaMA with continual pre-training

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. LLaMA-MoE: Building mixture-of-experts from LLaMA with continual pre-training. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15913–15923, Miami, Florida, USA, N...

work page doi:10.18653/v1/2024.emn 2024
[66]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. 15 A Limitations and Future Work Scale of experiments.Our experiments are conducted at 182M–978M parameter scales with 30B training tokens. While th...

work page internal anchor Pith review arXiv 2022