pith. machine review for the scientific record. sign in

arxiv: 2605.06665 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords mixture of expertsshared expert poolparameter efficiencydepth scalingLLaMA modelsauxiliary lossrouting stabilityvalidation loss
0
0 comments X

The pith

A single shared expert pool improves MoE loss and allows expert parameters to grow sublinearly with depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard Mixture-of-Experts models assign each transformer layer its own dedicated experts, so expert parameters increase linearly as models get deeper. This paper replaces that rule with UniPool, a design that keeps one global pool of experts and lets every layer's router pick from the same pool. A pool-level auxiliary loss balances how often each expert is used, while NormRouter keeps the sparse selections stable across layers. Experiments on five LLaMA-style models from 182M to 978M parameters trained on 30B tokens show consistent drops in validation loss compared with conventional per-layer MoE. Even when the shared pool is shrunk to 41.6-66.7% of the usual expert budget, performance matches or exceeds the baseline, demonstrating that expert capacity need not scale linearly with depth.

Core claim

UniPool replaces per-layer expert ownership with a single shared expert pool accessed by independent per-layer routers. A pool-level auxiliary loss maintains balanced utilization across the entire pool, and NormRouter supplies scale-stable sparse routing. Across five LLaMA-architecture scales trained on 30B tokens from the Pile, this yields up to 0.0386 lower validation loss than matched vanilla MoE. Reduced-pool versions using only 41.6%-66.7% of the vanilla expert-parameter budget still match or outperform layer-wise MoE, showing expert parameters can grow sublinearly with depth while remaining more efficient.

What carries the argument

A globally shared expert pool accessed by per-layer routers, stabilized by a pool-level auxiliary loss and NormRouter.

If this is right

  • Validation loss and perplexity improve over vanilla MoE at every tested scale from 182M to 978M parameters.
  • Expert parameters can scale sublinearly with depth while still matching or beating layer-wise allocation.
  • Pool size becomes an explicit, tunable hyperparameter for depth scaling rather than being fixed by layer count.
  • UniPool gains compose with finer-grained expert decomposition techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Expert specialization appears less layer-specific than standard designs assume, since deeper layers tolerate shared capacity without large accuracy loss.
  • Memory for expert weights can be stored once rather than replicated per layer, lowering the overall parameter footprint.
  • The architecture supports greater model depth without forcing proportional growth in expert compute or memory.

Load-bearing premise

A single pool-level auxiliary loss plus NormRouter is sufficient to maintain balanced expert utilization and stable gradients when the same experts are accessed by routers at every depth.

What would settle it

Training a UniPool model at any of the tested scales with a reduced pool size and finding that its validation loss and perplexity exceed those of the matched vanilla MoE baseline would falsify the sublinear-scaling claim.

Figures

Figures reproduced from arXiv: 2605.06665 by Chuanyang Zheng, Guoxuan Chen, Han Shi, Hong Cheng, Minbin Huang, Xintong Yu, Yichun Yin, Yimeng Wu.

Figure 1
Figure 1. Figure 1: UNIPOOL overview. Vanilla MoE allocates a private expert set to each transformer layer, tying expert parameters to depth and preventing cross-layer reuse. UNIPOOL replaces layer-private ownership with a single global expert pool while keeping independent per-layer routers. Pool-level balancing aggregates utilization over the shared pool, preventing globally unused experts without forcing every layer to use… view at source ↗
Figure 3
Figure 3. Figure 3: Expert utilization at the 182M scale: per-layer auxiliary loss leads to global expert collapse, view at source ↗
Figure 4
Figure 4. Figure 4: Validation loss curves. Panels (a)–(c) compare UNIPOOL with vanilla MoE at 182M, 469M, and 650M over 30B Pile tokens. Panel (d) shows the 182M sharing-scope ablation, where G=1 is full UNIPOOL and G=12 is vanilla MoE; grouped configurations interpolate between the endpoints. D Hyperparameter Details view at source ↗
read the original abstract

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes UniPool, replacing per-layer expert sets in MoE with a single globally shared expert pool accessed by independent per-layer routers. Motivated by a routing probe showing that random routing in deeper layers causes only minor accuracy drops, it introduces a pool-level auxiliary loss for utilization balance and adopts NormRouter for stable sparse routing. On five LLaMA scales (182M–978M parameters) trained on 30B Pile tokens, UniPool yields consistent validation loss reductions (up to 0.0386) over vanilla MoE baselines; reduced-pool variants using 41.6–66.7% of the expert-parameter budget match or exceed layer-wise MoE performance, indicating sublinear expert scaling with depth.

Significance. If the empirical results hold under scrutiny, the work is significant for demonstrating that expert capacity need not scale linearly with depth in MoE architectures, potentially improving parameter efficiency in large models. The multi-scale validation and the framing of pool size as an explicit hyperparameter are strengths, as is the note that benefits compose with finer-grained expert decomposition. The approach challenges a core convention in MoE design with concrete efficiency gains.

major comments (2)
  1. [Methods (auxiliary loss and NormRouter)] The pool-level auxiliary loss (described in the methods) equalizes aggregate expert counts but provides no mechanism or analysis to ensure per-layer router decisions produce non-conflicting activation patterns or stable gradients across depths when experts are fully shared. This assumption is load-bearing for the claim that reduced-pool UniPool remains stable and effective.
  2. [Experiments and Results] Results section and abstract: the reported loss reductions and reduced-pool matches lack error bars, training curves, or ablations on the auxiliary-loss coefficient and pool-size selection procedure. Without these, it is difficult to determine whether the consistent gains across the five scales are robust or sensitive to hyperparameter choices.
minor comments (2)
  1. [Introduction] The routing-probe result (1.0–1.6 point drop) is cited as motivation but would benefit from a brief table or figure showing the exact models and tasks used.
  2. [Abstract and Experiments] Notation for the shared pool size versus per-layer expert count could be made more explicit when comparing parameter budgets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense based on our work while noting where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Methods (auxiliary loss and NormRouter)] The pool-level auxiliary loss (described in the methods) equalizes aggregate expert counts but provides no mechanism or analysis to ensure per-layer router decisions produce non-conflicting activation patterns or stable gradients across depths when experts are fully shared. This assumption is load-bearing for the claim that reduced-pool UniPool remains stable and effective.

    Authors: The pool-level auxiliary loss balances aggregate utilization across the shared pool, and NormRouter is adopted specifically to stabilize sparse routing at scale. The routing probe in the introduction provides supporting evidence that deeper layers exhibit substantial redundancy, with random routing causing only minor drops, which motivates the feasibility of shared experts. Our results across five scales show consistent training stability and performance gains with reduced pools, indicating that independent per-layer routers learn compatible patterns in practice. To directly address the request for mechanism analysis, we will add a new subsection with per-layer activation correlation statistics in the revised manuscript. revision: partial

  2. Referee: [Experiments and Results] Results section and abstract: the reported loss reductions and reduced-pool matches lack error bars, training curves, or ablations on the auxiliary-loss coefficient and pool-size selection procedure. Without these, it is difficult to determine whether the consistent gains across the five scales are robust or sensitive to hyperparameter choices.

    Authors: We agree that error bars, curves, and targeted ablations would improve clarity on robustness. The consistency of gains across five distinct scales already serves as a form of multi-run validation, but we will revise the results section and appendix to include error bars from repeated seeds where compute permits, representative training curves, and ablations on the auxiliary loss coefficient as well as pool-size selection heuristics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical performance claims

full rationale

The paper presents an empirical architecture (UniPool) and reports measured validation loss/perplexity improvements across five model scales trained on 30B tokens. No derivation chain, first-principles result, or prediction is claimed that reduces by the paper's own equations to fitted inputs or self-citations. The routing probe is an experiment whose outcome is reported rather than used to derive the final metrics. The auxiliary loss and NormRouter are design choices whose effects are validated by direct training runs, not by construction. This is the common case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The work rests on standard transformer and MoE training assumptions plus two new design choices (global pool and pool-level loss) whose stability is asserted rather than derived.

free parameters (2)
  • pool size
    Explicitly varied as a depth-scaling hyperparameter; reduced-pool variants are compared to full vanilla budgets.
  • auxiliary loss coefficient
    Weight of the pool-level balancing loss; required for stable training but value not stated in abstract.
axioms (1)
  • domain assumption Transformer layers can be trained with standard optimizers and token-level cross-entropy when experts are shared across depth.
    Implicit in all reported training runs.
invented entities (2)
  • UniPool shared expert pool no independent evidence
    purpose: Single global set of experts accessed by independent per-layer routers.
    Core architectural change; no independent evidence outside the reported training runs.
  • NormRouter no independent evidence
    purpose: Sparse and scale-stable routing into the shared pool.
    Adopted to enable stable training; no external validation supplied.

pith-pipeline@v0.9.0 · 5660 in / 1539 out tokens · 54422 ms · 2026-05-08T11:51:46.361585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 38 canonical work pages · 20 internal anchors

  1. [1]

    Diep: Adaptive mixture-of-experts compression through differentiable expert pruning.arXiv preprint arXiv:2509.16105, 2025

    Sikai Bai, Haoxi Li, Jie Zhang, Zicong Hong, and Song Guo. Diep: Adaptive mixture-of-experts compression through differentiable expert pruning.arXiv preprint arXiv:2509.16105, 2025

  2. [2]

    PIQA: Reasoning about physical intuition by question answering

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical intuition by question answering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020

  3. [3]

    Unified scaling laws for routed language models

    Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Sherrington, Mia Saber, Jay Sherburn, Jean Sherrington, Michael Sherrington, et al. Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169, 2022

  4. [4]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936, 2019

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  6. [6]

    Man- ning

    Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, and Christopher D. Man- ning. MoEUT: Mixture-of-experts universal transformers. InAdvances in Neural Information Processing Systems, 2024

  7. [7]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wenge Zeng, Xingkai Yu, Y . Wu, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

  8. [8]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  9. [9]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  10. [10]

    Uni- versal transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. InInternational Conference on Learning Representations, 2019

  11. [11]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

  12. [12]

    Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological cybernetics, 36(4):193–202, 1980

    Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological cybernetics, 36(4):193–202, 1980

  13. [13]

    Visual feature extraction by a multilayered network of analog threshold elements.IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, 2007

    Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements.IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, 2007

  14. [14]

    MegaBlocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5, 2023

    Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. MegaBlocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5, 2023

  15. [15]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  16. [16]

    Transformer Feed-Forward Layers Are Key-Value Memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories.arXiv preprint arXiv:2012.14913, 2020

  17. [17]

    arXiv preprint arXiv:2407.04153 , year=

    Xu Owen He. Mixture of a million experts.arXiv preprint arXiv:2407.04153, 2024

  18. [18]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 11

  19. [19]

    A theory of steady-state activity in nerve-fiber networks: I

    Alston S Householder. A theory of steady-state activity in nerve-fiber networks: I. definitions and preliminary lemmas.The bulletin of mathematical biophysics, 3(2):63–69, 1941

  20. [20]

    Dick, Yuan Cheng, Fan Yang, Tun Lu, Chun Zhang, and Li Shang

    Ruijun Huang, Fang Dong, Xin Zhang, Hengjie Cao, Zhendong Huang, Anrui Chen, Jixian Zhou, Mengyi Chen, Yifeng Yang, Mingzhi Dong, et al. Sd-moe: Spectral decomposition for effective expert specialization.arXiv preprint arXiv:2602.12556, 2026

  21. [21]

    Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5, 2023

    Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Parijat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5, 2023

  22. [22]

    Jacobs, Michael I

    Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991. doi: 10.1162/neco.1991.3.1. 79

  23. [23]

    Mixtral of Experts

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  24. [24]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  25. [25]

    arXiv preprint arXiv:2402.07871 , year=

    Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Piontkowski, Piotr Piotrowski, Szymon Antoniak, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024

  26. [26]

    RACE: Large-scale reading comprehension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, 2017

  27. [27]

    ALBERT: A lite BERT for self-supervised learning of language representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations, 2020

  28. [28]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2021

  29. [29]

    BASE layers: Simplifying training of large, sparse models

    Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. BASE layers: Simplifying training of large, sparse models. InInternational Conference on Machine Learning, 2021

  30. [30]

    Towards a unified view of sparse feed-forward network in pretraining large language model

    Zeyu Liu, Tim Dettmers, Xi Lin, Veselin Stoyanov, and Xian Li. Towards a unified view of sparse feed-forward network in pretraining large language model. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15038–15061, Singapore, December 2023. Association for ...

  31. [31]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2019

  32. [32]

    Olmoe: Open mixture-of-experts language models

    Niklas Muennighoff, Luca Soldaini Yang, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Peter Izsak, et al. OLMoE: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060, 2024

  33. [33]

    On estimating regression.Theory of Probability & Its Applications, 9(1): 141–142, 1964

    Elizbar A Nadaraya. On estimating regression.Theory of Probability & Its Applications, 9(1): 141–142, 1964

  34. [34]

    Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th International Conference on Machine Learning, pages 807–814. Omnipress, 2010. 12

  35. [35]

    On least square estimation in softmax gating mixture of experts.arXiv preprint arXiv:2402.02952, 2024

    Huy Nguyen, Nhat Ho, and Alessandro Rinaldo. On least square estimation in softmax gating mixture of experts.arXiv preprint arXiv:2402.02952, 2024

  36. [36]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Dufter, Quan Pham, Raffaella Bernardi, and Marco Baroni. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1525–1534, 2016

  37. [37]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  38. [38]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446, 2021

  39. [39]

    Deepspeed- moe: Advancing mixture-of-experts inference and trainingtopowernext-generationaiscale, 2022

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture- of-experts inference and training to power next-generation AI scale.arXiv preprint arXiv:2201.05596, 2022

  40. [40]

    Hash layers for large sparse models.Advances in Neural Information Processing Systems, 34, 2021

    Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse models.Advances in Neural Information Processing Systems, 34, 2021

  41. [41]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  42. [42]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  43. [43]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  44. [45]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2020

  45. [46]

    Dolma: an open corpus of three trillion tokens for language model pretraining research

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...

  46. [47]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  47. [48]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 13

  48. [49]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  49. [50]

    Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

  50. [51]

    Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

    Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024

  51. [52]

    Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

    Geoffrey S Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

  52. [53]

    Data mining and knowledge discovery , 33(4):917–963

    Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Derczynski, Wei Xu, Alan Ritter, and Tim Baldwin, editors,Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URLhttps:/...

  53. [54]

    Sere: Similarity-based expert re-routing for efficient batch decoding in moe models.arXiv preprint arXiv:2602.07616, 2026

    Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, and Li Yuan. Sere: Similarity-based expert re-routing for efficient batch decoding in moe models.arXiv preprint arXiv:2602.07616, 2026

  54. [55]

    Sheared LLaMA: Accelerating language model pre-training via structured pruning

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared LLaMA: Accelerating language model pre-training via structured pruning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=09iOdaeOzp

  55. [56]

    OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024

    Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024

  56. [57]

    HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

  57. [58]

    Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

  58. [59]

    TinyLlama: An Open-Source Small Language Model

    Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385, 2024

  59. [60]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

  60. [61]

    Calibrate before use: Improving few-shot performance of language models

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pages 12697–12706. PMLR, 2021

  61. [62]

    Understanding the mixture-of-experts with nadaraya-watson kernel.arXiv preprint arXiv:2509.25913, 2025

    Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, et al. Understanding the mixture-of-experts with nadaraya-watson kernel.arXiv preprint arXiv:2509.25913, 2025

  62. [63]

    Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

    Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

  63. [64]

    Mixture-of-experts with expert choice routing

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, and James Laudon. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35, 2022

  64. [65]

    LLaMA-MoE: Building mixture-of-experts from LLaMA with continual pre-training

    Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. LLaMA-MoE: Building mixture-of-experts from LLaMA with continual pre-training. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15913–15923, Miami, Florida, USA, N...

  65. [66]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. 15 A Limitations and Future Work Scale of experiments.Our experiments are conducted at 182M–978M parameter scales with 30B training tokens. While th...