Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

Bei Yu; Hui-Ling Zhen; Lancheng Zou; Mingxuan Yuan; Sinno Jialin Pan; Wulong Liu; Xianzhi Yu; Zehua Pei

arxiv: 2502.04416 · v3 · submitted 2025-02-06 · 💻 cs.LG · cs.AI

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

Zehua Pei , Hui-Ling Zhen , Lancheng Zou , Xianzhi Yu , Wulong Liu , Sinno Jialin Pan , Mingxuan Yuan , Bei Yu This is my paper

Pith reviewed 2026-05-23 04:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords activation patternsmixture of expertsfeed-forward networksmodel restructuringpost-trainingsparse activationrouter constructioninference optimization

0 comments

The pith

A post-training method converts dense feed-forward networks into mixture-of-experts models by analyzing neuron activation patterns on a small dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that existing dense FFN layers in LLMs can be turned into sparse MoE layers without large-scale retraining. It does this by examining how neurons activate on a calibration set of about 2k samples, separating them into always-on shared experts and conditionally used routed experts, then building a router directly from those activation statistics. If the approach holds, it would let practitioners add MoE sparsity to pre-trained models in minutes rather than requiring hundreds of billions of tokens of fine-tuning, while preserving most accuracy and delivering measurable inference speedups.

Core claim

The central claim is that an analytical post-training framework can restructure FFNs into sparse MoE architectures by partitioning neurons according to their activation patterns observed on a small calibration dataset, designating always-active neurons as shared experts and conditionally active ones as routed experts, and constructing the router analytically from representative neuron statistics, which supports immediate deployment or brief fine-tuning and extends recursively to existing MoE models.

What carries the argument

Activation pattern analysis that partitions neurons into always-active shared experts and conditionally activated routed experts, with an analytically constructed router from representative neuron statistics.

If this is right

The restructured model can be deployed immediately without further training.
Optional lightweight fine-tuning on 2k samples yields up to 1.17× speedup in compute-bound scenarios.
The same partitioning procedure applies recursively to already-converted MoE models to create hierarchical sparsity.
Processing time remains on the order of minutes regardless of original model size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on non-transformer architectures that contain analogous feed-forward sublayers to check whether activation-based partitioning generalizes.
If the calibration set size requirement stays small, the technique might lower the cost of exploring MoE variants during model development cycles.
Repeated application across multiple layers might compound sparsity gains beyond what single-layer conversion achieves.

Load-bearing premise

Activation patterns observed on a small calibration dataset of roughly 2k samples are representative enough to determine a neuron partitioning and router that generalizes to the model's full data distribution.

What would settle it

Apply the restructuring to a dense model, run inference on a held-out test set drawn from the same distribution, and observe either no speedup in compute-bound regimes or a substantial accuracy drop compared with the original model.

Figures

Figures reproduced from arXiv: 2502.04416 by Bei Yu, Hui-Ling Zhen, Lancheng Zou, Mingxuan Yuan, Sinno Jialin Pan, Wulong Liu, Xianzhi Yu, Zehua Pei.

**Figure 1.** Figure 1: The overview of our proposed CMoE. CMoE transforms dense LLMs into sparsely activated MoE architectures through two key phases: efficient expert grouping and training-free router construction, followed by optional lightweight adaptation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Trade-off between Model Performance (PPL) and Construction Time with Increasing [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of Load Balancing on expert utilization in Llama-2 7B final block ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Scaling large language models (LLMs) improves performance but significantly increases inference costs, with feed-forward networks (FFNs) consuming the majority of computational resources. While Mixture-of-Experts (MoE) architectures can reduce this cost through sparse activation, restructuring existing dense models into MoEs typically requires extensive retraining on hundreds of billions of tokens. We propose an analytical post-training framework that rapidly restructures FFNs into sparse MoE architectures using only a small calibration dataset. The method analyzes neuron activation patterns to partition neurons into always-active shared experts and conditionally activated routed experts, then constructs a router analytically from representative neuron statistics, enabling immediate deployment or optional lightweight fine-tuning. This approach applies both to dense models and recursively to existing MoE models for hierarchical sparsity. Experiments demonstrate up to $1.17\times$ speedup in compute-bound scenarios with only minutes of processing and 2k-sample fine-tuning, outperforming methods requiring orders of magnitude more resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The analytical activation-pattern method for turning FFNs into MoE without router training is the real novelty, but whether 2k calibration samples produce a stable partition and router that holds up is the open question.

read the letter

The punchline is that this paper gives a post-training way to restructure dense FFNs into MoE by partitioning neurons based on activation stats from a small calibration set and then building the router analytically from those stats. No gradient-based router training is needed, which sets it apart from most conversion work that retrains heavily. It also claims the same trick can be applied recursively to existing MoE models. That analytical step is the concrete new piece and it directly targets the practical problem of getting sparse inference on already-trained models with low overhead. The reported 1.17x speedup after minutes of processing plus optional light fine-tuning on the same 2k samples is the kind of result that would matter for deployment if it holds. The approach is straightforward and the abstract is clear about the pipeline. The main soft spot is the reliance on activation patterns from roughly 2k samples being representative enough to fix the shared-versus-routed split and to make the derived router produce the intended sparsity at inference time. A distribution shift could change which neurons look always-active, breaking the sparsity or forcing extra computation. The abstract supplies no accuracy numbers, no baseline comparisons, and no error bars, so the empirical side is still thin. This is the sort of paper that belongs in a reading group focused on efficient inference. It deserves peer review because the analytical construction is distinct and the goal is useful, even though the current evidence leaves the generalization claim untested. I would send it out rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper claims an analytical post-training method to restructure dense FFN layers (and recursively existing MoE layers) into sparse MoE architectures. It partitions neurons into always-active shared experts and conditionally routed experts by thresholding activation statistics collected on a ~2k-sample calibration set, then derives a router analytically from those statistics. This enables immediate deployment or optional 2k-sample fine-tuning, with claimed speedups up to 1.17× in compute-bound regimes after only minutes of processing and without the hundreds of billions of tokens required by prior restructuring approaches.

Significance. If the activation-pattern partitioning and analytical router generalize beyond the calibration set while preserving accuracy, the approach would offer a low-cost route to sparse inference for existing dense models, substantially lowering the barrier to MoE deployment compared with full retraining. The recursive extension to existing MoEs and the parameter-free derivation of the router from calibration statistics are notable strengths if empirically validated.

major comments (3)

[Abstract and §3] Abstract and §3 (method description): the central claim of immediate deployment or 2k-sample fine-tuning delivering 1.17× speedup with negligible accuracy loss rests on the assumption that activation frequencies observed on the ~2k calibration samples are representative of the full data distribution; no invariance argument, distribution-shift experiment, or stability analysis of the resulting partition is supplied, which directly undermines the sparsity guarantee at inference time.
[Abstract] Abstract: the reported empirical speedups are stated without accompanying quantitative results on accuracy retention, baseline comparisons (e.g., against dense model or existing MoE conversion methods), error bars, or the exact thresholding/partitioning procedure, making it impossible to assess whether the claimed performance trade-off holds.
[§4] §4 (experiments): the absence of any reported router-fidelity metric (i.e., how closely the analytically constructed router reproduces the intended expert allocation on held-out data) leaves the second load-bearing condition of the skeptic's note unaddressed.

minor comments (2)

[§3] Notation for shared vs. routed experts and the precise definition of the analytical router construction should be formalized with equations rather than prose descriptions.
[§4] The recursive application to existing MoE models is mentioned but lacks a dedicated experiment or ablation showing hierarchical sparsity gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify several gaps in empirical validation and reporting that we will address through targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method description): the central claim of immediate deployment or 2k-sample fine-tuning delivering 1.17× speedup with negligible accuracy loss rests on the assumption that activation frequencies observed on the ~2k calibration samples are representative of the full data distribution; no invariance argument, distribution-shift experiment, or stability analysis of the resulting partition is supplied, which directly undermines the sparsity guarantee at inference time.

Authors: We agree that the current manuscript lacks an explicit analysis of partition stability or distribution-shift robustness. In the revised version we will add a new subsection in §3 (and supporting results in §4) that reports activation-pattern consistency on held-out data drawn from multiple domains, together with a simple stability metric (e.g., Jaccard overlap of expert assignments across calibration subsets). This will directly support the generalization claim. revision: yes
Referee: [Abstract] Abstract: the reported empirical speedups are stated without accompanying quantitative results on accuracy retention, baseline comparisons (e.g., against dense model or existing MoE conversion methods), error bars, or the exact thresholding/partitioning procedure, making it impossible to assess whether the claimed performance trade-off holds.

Authors: The abstract currently summarizes only the headline speedup. We will expand it to include the key quantitative figures already present in §4 (accuracy retention relative to the dense baseline, comparison against prior restructuring methods, and standard-error bars across runs) while keeping the abstract concise. The exact thresholding procedure will also be stated explicitly in the abstract and §3. revision: yes
Referee: [§4] §4 (experiments): the absence of any reported router-fidelity metric (i.e., how closely the analytically constructed router reproduces the intended expert allocation on held-out data) leaves the second load-bearing condition of the skeptic's note unaddressed.

Authors: We will add a router-fidelity evaluation in §4 that measures, on held-out calibration and test sets, both (a) the agreement between the analytical router’s expert selection and the neuron-activation ground truth and (b) the resulting load balance. These metrics will be reported alongside the existing speedup and accuracy numbers. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation uses external calibration data for analytical restructuring

full rationale

The paper's core steps—collecting activation statistics on a 2k-sample calibration set, partitioning neurons into shared vs. routed experts by thresholding those statistics, and constructing the router analytically from the same observed frequencies—operate on external data rather than reducing any claimed prediction or result to a self-definition, fitted input renamed as prediction, or self-citation chain. No load-bearing uniqueness theorems, smuggled ansatzes, or renamings of known results appear in the derivation; the method remains self-contained against the calibration inputs without tautological closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; assessment limited to surface description.

pith-pipeline@v0.9.0 · 5716 in / 1084 out tokens · 45560 ms · 2026-05-23T04:04:07.796630+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
cs.DC 2026-05 unverdicted novelty 6.0

GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

work page 2024
[4]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[6]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022

work page 2022
[7]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[8]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023

work page 2023
[10]

Moefication: Transformer feed-forward layers are mixtures of experts

Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication: Transformer feed-forward layers are mixtures of experts. arXiv preprint arXiv:2110.01786, 2021

work page arXiv 2021
[11]

Fusegpt: Learnable layers fusion of generative pre-trained transformers.arXiv preprint arXiv:2411.14507, 2024

Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, and Bei Yu. Fusegpt: Learnable layers fusion of generative pre-trained transformers.arXiv preprint arXiv:2411.14507, 2024

work page arXiv 2024
[12]

Llama-moe: Building mixture-of-experts from llama with continual pre-training

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. Llama-moe: Building mixture-of-experts from llama with continual pre-training. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15913– 15923, 2024

work page 2024
[13]

Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training

Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, and Yu Cheng. Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training. arXiv preprint arXiv:2411.15708, 2024

work page arXiv 2024
[14]

Learn to be efficient: Build structured sparsity in large language models.arXiv preprint arXiv:2402.06126, 2024

Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z Morley Mao, Beidi Chen, Fan Lai, and Atul Prakash. Learn to be efficient: Build structured sparsity in large language models.arXiv preprint arXiv:2402.06126, 2024

work page arXiv 2024
[15]

A shortest augmenting path algorithm for dense and sparse linear assignment problems

Roy Jonker and Ton V olgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. In DGOR/NSOR: Papers of the 16th Annual Meeting of DGOR in Cooperation with NSOR/Vorträge der 16. Jahrestagung der DGOR zusammen mit der NSOR, pages 622–622. Springer, 1988

work page 1988
[16]

Moe- bert: from bert to mixture-of-experts via importance-guided adaptation

Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, and Weizhu Chen. Moe- bert: from bert to mixture-of-experts via importance-guided adaptation. arXiv preprint arXiv:2204.07675, 2022

work page arXiv 2022
[17]

Xmoe: Sparse models with fine-grained and adaptive expert selection.arXiv preprint arXiv:2403.18926, 2024

Yuanhang Yang, Shiyi Qi, Wenchao Gu, Chaozheng Wang, Cuiyun Gao, and Zenglin Xu. Xmoe: Sparse models with fine-grained and adaptive expert selection.arXiv preprint arXiv:2403.18926, 2024. 10

work page arXiv 2024
[18]

Sparse upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022

work page arXiv 2022
[19]

Parameter-efficient sparsity crafting from dense to mixture-of-experts for instruction tuning on general tasks.arXiv preprint arXiv: 2401.02731, 2024b

Haoyuan Wu, Haisheng Zheng, Zhuolun He, and Bei Yu. Parameter-efficient sparsity crafting from dense to mixture-of-experts for instruction tuning on general tasks. arXiv preprint arXiv:2401.02731, 2024

work page arXiv 2024
[20]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024

work page 2024
[21]

Quantization via distillation and contrastive learning

Zehua Pei, Xufeng Yao, Wenqian Zhao, and Bei Yu. Quantization via distillation and contrastive learning. IEEE Transactions on Neural Networks and Learning Systems, 2023

work page 2023
[22]

Bie: Bi-exponent block floating-point for large language models quantization

Lancheng Zou, Wenqian Zhao, Shuo Yin, Chen Bai, Qi Sun, and Bei Yu. Bie: Bi-exponent block floating-point for large language models quantization. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[23]

Qserve: W4A8KV4 quantization and system co-design for efficient LLM serving.CoRR, abs/2405.04532, 2024

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532, 2024

work page arXiv 2024
[24]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of- experts large language models. arXiv preprint arXiv:2402.14800, 2024

work page arXiv 2024
[25]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024

work page arXiv 2024
[26]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024

work page arXiv 2024
[27]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

T Wolf. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[28]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

work page 2019
[29]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[30]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama , 2023

work page 2023
[31]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[34]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[35]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 11

work page 2020
[36]

Crowdsourcing Multiple Choice Science Questions

Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

work page 2021
[38]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019. 12 A Algorithmic Analysis − 0.15 − 0 .1 − 5 ·10 − 2 0 5 ·10 − 2 0 .1 0.15 0.2 0 200 400 600 Min: -0.134766 Max: 0.202148 25th percentile: -0.005768 75th percentile: 0.006073 Hidden State Value...

work page internal anchor Pith review Pith/arXiv arXiv 1905

[1] [1]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

work page 2024

[4] [4]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[6] [6]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022

work page 2022

[7] [7]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022

[8] [8]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023

work page 2023

[10] [10]

Moefication: Transformer feed-forward layers are mixtures of experts

Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication: Transformer feed-forward layers are mixtures of experts. arXiv preprint arXiv:2110.01786, 2021

work page arXiv 2021

[11] [11]

Fusegpt: Learnable layers fusion of generative pre-trained transformers.arXiv preprint arXiv:2411.14507, 2024

Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, and Bei Yu. Fusegpt: Learnable layers fusion of generative pre-trained transformers.arXiv preprint arXiv:2411.14507, 2024

work page arXiv 2024

[12] [12]

Llama-moe: Building mixture-of-experts from llama with continual pre-training

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. Llama-moe: Building mixture-of-experts from llama with continual pre-training. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15913– 15923, 2024

work page 2024

[13] [13]

Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training

Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, and Yu Cheng. Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training. arXiv preprint arXiv:2411.15708, 2024

work page arXiv 2024

[14] [14]

Learn to be efficient: Build structured sparsity in large language models.arXiv preprint arXiv:2402.06126, 2024

Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z Morley Mao, Beidi Chen, Fan Lai, and Atul Prakash. Learn to be efficient: Build structured sparsity in large language models.arXiv preprint arXiv:2402.06126, 2024

work page arXiv 2024

[15] [15]

A shortest augmenting path algorithm for dense and sparse linear assignment problems

Roy Jonker and Ton V olgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. In DGOR/NSOR: Papers of the 16th Annual Meeting of DGOR in Cooperation with NSOR/Vorträge der 16. Jahrestagung der DGOR zusammen mit der NSOR, pages 622–622. Springer, 1988

work page 1988

[16] [16]

Moe- bert: from bert to mixture-of-experts via importance-guided adaptation

Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, and Weizhu Chen. Moe- bert: from bert to mixture-of-experts via importance-guided adaptation. arXiv preprint arXiv:2204.07675, 2022

work page arXiv 2022

[17] [17]

Xmoe: Sparse models with fine-grained and adaptive expert selection.arXiv preprint arXiv:2403.18926, 2024

Yuanhang Yang, Shiyi Qi, Wenchao Gu, Chaozheng Wang, Cuiyun Gao, and Zenglin Xu. Xmoe: Sparse models with fine-grained and adaptive expert selection.arXiv preprint arXiv:2403.18926, 2024. 10

work page arXiv 2024

[18] [18]

Sparse upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022

work page arXiv 2022

[19] [19]

Parameter-efficient sparsity crafting from dense to mixture-of-experts for instruction tuning on general tasks.arXiv preprint arXiv: 2401.02731, 2024b

Haoyuan Wu, Haisheng Zheng, Zhuolun He, and Bei Yu. Parameter-efficient sparsity crafting from dense to mixture-of-experts for instruction tuning on general tasks. arXiv preprint arXiv:2401.02731, 2024

work page arXiv 2024

[20] [20]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024

work page 2024

[21] [21]

Quantization via distillation and contrastive learning

Zehua Pei, Xufeng Yao, Wenqian Zhao, and Bei Yu. Quantization via distillation and contrastive learning. IEEE Transactions on Neural Networks and Learning Systems, 2023

work page 2023

[22] [22]

Bie: Bi-exponent block floating-point for large language models quantization

Lancheng Zou, Wenqian Zhao, Shuo Yin, Chen Bai, Qi Sun, and Bei Yu. Bie: Bi-exponent block floating-point for large language models quantization. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[23] [23]

Qserve: W4A8KV4 quantization and system co-design for efficient LLM serving.CoRR, abs/2405.04532, 2024

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532, 2024

work page arXiv 2024

[24] [24]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of- experts large language models. arXiv preprint arXiv:2402.14800, 2024

work page arXiv 2024

[25] [25]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024

work page arXiv 2024

[26] [26]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024

work page arXiv 2024

[27] [27]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

T Wolf. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[28] [28]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

work page 2019

[29] [29]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[30] [30]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama , 2023

work page 2023

[31] [31]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[33] [33]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[34] [34]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[35] [35]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 11

work page 2020

[36] [36]

Crowdsourcing Multiple Choice Science Questions

Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

work page 2021

[38] [38]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019. 12 A Algorithmic Analysis − 0.15 − 0 .1 − 5 ·10 − 2 0 5 ·10 − 2 0 .1 0.15 0.2 0 200 400 600 Min: -0.134766 Max: 0.202148 25th percentile: -0.005768 75th percentile: 0.006073 Hidden State Value...

work page internal anchor Pith review Pith/arXiv arXiv 1905