DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

Chaojun Xiao; Chenyang Song; Weilin Zhao; Xu Han; Yingfa Chen; Zhiyuan Liu

arxiv: 2605.10933 · v3 · pith:GAIYONA2new · submitted 2026-05-11 · 💻 cs.LG · cs.CL

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

Chenyang Song , Weilin Zhao , Xu Han , Chaojun Xiao , Yingfa Chen , Zhiyuan Liu This is my paper

Pith reviewed 2026-05-21 07:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords mixture of expertssparse modelsend-side devicestransformer efficiencyrouting mechanismactivation functionmodel deployment

0 comments

The pith

DECO sparse MoE matches dense Transformer performance while activating only 20% of experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents DECO as a sparse Mixture-of-Experts architecture that achieves performance on par with dense models using the same total parameters and training data. It employs a differentiable ReLU-based routing system augmented with learnable scaling for each expert to balance their contributions adaptively. A new activation called NormSiLU is introduced to stabilize the activation ratios and increase intrinsic sparsity. The design also finds benefits in simplifying experts to non-gated MLPs. This setup targets efficient deployment on end-side devices by cutting storage and memory demands while delivering speedups through custom kernels.

Core claim

DECO achieves dense-comparable performance in a sparse MoE setup by activating only 20% of routed experts through its ReLU-based routing with learnable expert-wise scaling that balances routed and shared experts, along with NormSiLU for more stable sparsity trends, and shows advantages in non-gated MLP experts.

What carries the argument

Differentiable ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances contributions of routed and shared experts.

If this is right

Models can run inference with much lower computational cost and memory access on resource-limited devices.
Training remains efficient as total parameters and tokens match dense baselines.
Custom acceleration kernels enable significant speedups, such as nearly 3 times on Jetson AGX Orin.
MoE designs can be simplified without gating mechanisms in experts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This routing technique might apply to other types of sparse neural networks beyond language models.
Reduced need for complex post-training fixes could make MoE more accessible for developers.
Energy savings on edge devices could enable more advanced AI applications in mobile and IoT settings.

Load-bearing premise

The ReLU-based routing and scaling factors will consistently maintain stable activation ratios and balanced expert contributions across different datasets without extra tuning.

What would settle it

Training a DECO model on a new dataset or scaling to larger sizes and observing that performance falls short of dense models or requires significant hyperparameter changes to stabilize.

Figures

Figures reproduced from arXiv: 2605.10933 by Chaojun Xiao, Chenyang Song, Weilin Zhao, Xu Han, Yingfa Chen, Zhiyuan Liu.

**Figure 1.** Figure 1: The “ideal triangle” of end-side MoE. Beyond the high performance and reduced computational cost of sparse MoE, the model should maintain a minimal storage footprint, achieving high performance within dense-comparable total parameter budgets. creasingly prominent model architecture. The key property of MoE is the sparse activation, namely, activating a small subset of expert modules from a large pool of pa… view at source ↗

**Figure 2.** Figure 2: The overall architecture of DECO. For router design, we adopt ReLU-based routing enhanced by learnable expert-wise router scaling. For expert design, we propose NormSiLU as a better routed-expert activation function and employ non-gated MLP experts. For precise sparsity control, we employ adaptive sparsity regularization. optimal settings of DeepSeek-V3-style MoE architectures that enable them to surpass d… view at source ↗

**Figure 3.** Figure 3: The evaluation results of DECO versus baseline settings. “PPL” and “Task” indicate the C4 validation perplexity and the average accuracy (%) on downstream benchmarks, respectively. DeepSeek-V3 uses gated MLP experts, and ReMoE uses non-gated ones. This is due to their better performance than the opposite settings, see Section 4.4 for detailed discussions. DECO’s efficiency in maintaining dense-level repres… view at source ↗

**Figure 4.** Figure 4: The distribution of routed-expert output norms in the first MoE layer of DECO (Medium) on the C4 validation set, which shows clear expert-wise heterogeneity. To demonstrate the effect of DECO’s router scaling design, we experiment on two ablation settings: “Fixed” adopts a constant scaling factor for all routed experts, and “Scalar” involves a single learnable scalar scaling factor shared by experts. Both … view at source ↗

**Figure 5.** Figure 5: The trend of the regularization coefficient of DECO (Small) and ablation settings each removing one step of NormSiLU. The settings “SiLU” and “w/o RMS” show significantly higher coefficients, which potentially harm performance. 0 3000 6000 9000 12000 15000 Training Step 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Absolute SiLU Output Magnitudes SiLU w/o RMS w/o Mean NormSiLU [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The trend of the regularization coefficient of DECO (Small) and ablation settings without different steps of NormSiLU. The baseline “SiLU” and “w/o RMS” settings show significantly higher coefficients, which potentially harm performance. 0 3000 6000 9000 12000 15000 Training Step 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Absolute SiLU Output Magnitudes SiLU w/o RMS w/o Mean NormSiLU [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 7.** Figure 7: The trend of routed-expert activation ratio of [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: The trend of routed-expert activation ratio of DECO (Small) using different expert gating policies. tectures: DeepSeek-V3, ReMoE, and DECO. DeepSeek-V3 is a well-performing MoE architecture using a fixed pertoken activation ratio, while ReMoE and DECO use ReLUbased routing to implement a flexible activation ratio. For each architecture, we compare non-gated MLP experts (NG) against gated MLP experts (GA)… view at source ↗

**Figure 10.** Figure 10: The impact of the expert granularity (g = 4dh/de) on performance of DECO. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 9.** Figure 9: The impact of the routed-expert activation ratio on the performance of DECO (Small and Medium). 32 64 96 128 160 192 224 256 Intermediate Dimension of Shared Expert 28 29 30 31 32 33 34 35 C4 Validation PPL DECO (Small) Dense (Small) DECO (Medium) Dense (Medium) [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 9.** Figure 9: The impact of shared expert sizes on perfor [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of routed experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 2.93$\times$ speedup on Jetson AGX Orin compared with dense inference. Code and checkpoints are available at https://github.com/thunlp/DECO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DECO gets a sparse MoE to match dense performance at 20% activation on edge hardware with a released kernel, but the routing stability claim needs checking against the full ablations.

read the letter

DECO is worth a look if you care about running larger models on phones or embedded boards without the usual memory bloat. The headline result is that they hit dense Transformer accuracy under the same total parameter count and training tokens while routing only 20% of experts, plus a custom kernel that delivers 2.93x speedup on Jetson AGX Orin. They also ship code and checkpoints, which makes the numbers easier to verify directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces DECO, a sparse Mixture-of-Experts architecture for end-side devices that uses differentiable ReLU-based routing augmented by learnable expert-wise scaling factors, together with the NormSiLU activation function, to achieve performance comparable to a dense Transformer while activating only 20% of the routed experts under identical total parameter budgets and training token counts. It further reports that non-gated MLP experts work well with this router, outperforms prior MoE baselines, and delivers a 2.93× inference speedup on Jetson AGX Orin via a custom kernel. Code and checkpoints are released.

Significance. If the central empirical claims hold under controlled conditions, the result would be significant for practical deployment of large models on memory- and storage-constrained edge hardware, because it simultaneously reduces active parameters, maintains dense-level accuracy, and provides measured acceleration. The public release of code and checkpoints is a clear strength that supports reproducibility and follow-up work.

major comments (2)

[§4.1 and Table 2] §4.1 and Table 2: the headline result that DECO matches dense performance while activating only ~20% of routed experts rests on the ReLU router plus learnable scaling producing both the target sparsity and balanced expert contributions without dataset-specific retuning. The manuscript does not report the variance of the activation ratio across random seeds, model scales, or task distributions, nor does it show an ablation removing the learnable scaling factors; without these controls the 20% figure could be an artifact of the particular training run rather than a robust architectural property.
[§3.3, Eq. (7)–(9)] §3.3, Eq. (7)–(9): the NormSiLU definition and the claim that it produces “a more stable trend of routed-expert activation ratio” are presented without a quantitative comparison (e.g., standard deviation of activation ratio over training steps) against a plain SiLU baseline under the same routing setup. This stability is load-bearing for the reproducibility of the 20% activation result.

minor comments (2)

[Abstract and §4.2] The abstract and §4.2 report a 2.93× speedup but do not state the batch size, sequence length, or precision used for the Jetson AGX Orin measurement; adding these details would improve clarity.
[Figure 3] Figure 3 caption should explicitly note whether the plotted activation ratios are averaged over the final 10% of training steps or measured at convergence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of robustness and reproducibility that we address below. We commit to incorporating additional experiments and quantitative analyses in the revised version to strengthen the empirical claims.

read point-by-point responses

Referee: [§4.1 and Table 2] §4.1 and Table 2: the headline result that DECO matches dense performance while activating only ~20% of routed experts rests on the ReLU router plus learnable scaling producing both the target sparsity and balanced expert contributions without dataset-specific retuning. The manuscript does not report the variance of the activation ratio across random seeds, model scales, or task distributions, nor does it show an ablation removing the learnable scaling factors; without these controls the 20% figure could be an artifact of the particular training run rather than a robust architectural property.

Authors: We agree that additional controls would strengthen the robustness claim. In the revised manuscript we will report the mean and standard deviation of the activation ratio across at least three independent random seeds for the primary experiments. We will also add an ablation study that removes the learnable expert-wise scaling factors while keeping all other components fixed, showing that the activation ratio deviates from the target 20% and becomes less balanced. Regarding model scales and task distributions, the existing experiments already cover multiple model sizes and diverse tasks without per-task retuning; we will explicitly tabulate the activation ratios across these settings to demonstrate consistency. revision: yes
Referee: [§3.3, Eq. (7)–(9)] §3.3, Eq. (7)–(9): the NormSiLU definition and the claim that it produces “a more stable trend of routed-expert activation ratio” are presented without a quantitative comparison (e.g., standard deviation of activation ratio over training steps) against a plain SiLU baseline under the same routing setup. This stability is load-bearing for the reproducibility of the 20% activation result.

Authors: We concur that a direct quantitative comparison is necessary to substantiate the stability claim. In the revised manuscript we will include a new figure or table that plots the routed-expert activation ratio over training steps for both NormSiLU and a plain SiLU baseline under identical routing and training configurations. We will report the standard deviation of the activation ratio across steps for each, confirming the improved stability of NormSiLU and its contribution to maintaining the target sparsity level. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated by experiments

full rationale

The paper introduces DECO as a new sparse MoE design using differentiable ReLU-based routing with learnable expert-wise scaling and the NormSiLU activation function. It reports experimental outcomes showing that activating only 20% of routed experts matches dense Transformer performance under matched total parameter and token budgets. No equations, derivations, or first-principles predictions are presented in the abstract or described claims; the central results rest on reported empirical measurements rather than any quantity that reduces to a fitted parameter or self-defined input by construction. No load-bearing self-citations or uniqueness theorems are invoked to force the architecture. The derivation chain is therefore self-contained as an empirical proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The design relies on standard neural network training assumptions plus the empirical claim that the introduced routing and activation choices produce stable sparsity; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)

learnable expert-wise scaling factors
Additional per-expert parameters introduced to balance routed and shared expert contributions.

pith-pipeline@v0.9.0 · 5769 in / 1215 out tokens · 48355 ms · 2026-05-21T07:54:05.184424+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling... NormSiLU... adaptive sparsity regularization... router entropy loss
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

activating only 20% of routed experts, matches dense performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

192 extracted references · 192 canonical work pages · 30 internal anchors

[1]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=. 2020 , url=

work page 2020
[2]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

work page 2022
[5]

2023 , url =

OpenAI , title =. 2023 , url =

work page 2023
[6]

2023 , url=

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=. 2023 , url=

work page 2023
[7]

2023 , url=

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=. 2023 , url=

work page 2023
[8]

Efficiently scaling

Pope, Reiner and Douglas, Sholto and Chowdhery, Aakanksha and Devlin, Jacob and Bradbury, James and Heek, Jonathan and Xiao, Kefan and Agrawal, Shivani and Dean, Jeff , journal=. Efficiently scaling. 2023 , url=

work page 2023
[9]

2022 , organization=

Aminabadi, Reza Yazdani and Rajbhandari, Samyam and Awan, Ammar Ahmad and Li, Cheng and Li, Du and Zheng, Elton and Ruwase, Olatunji and Smith, Shaden and Zhang, Minjia and Rasley, Jeff and others , booktitle=. 2022 , organization=

work page 2022
[10]

International Conference on Machine Learning , pages=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[11]

Advances in Neural Information Processing Systems , volume=

Towards efficient post-training quantization of pre-trained language models , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

work page 2022
[12]

A comprehensive study on post-training quantization for large language models

A comprehensive study on post-training quantization for large language models , author=. arXiv preprint arXiv:2303.08302 , year=

work page arXiv
[13]

A Simple and Effective Pruning Approach for Large Language Models

A Simple and Effective Pruning Approach for Large Language Models , author=. arXiv preprint arXiv:2306.11695 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

2023 , organization=

Frantar, Elias and Alistarh, Dan , booktitle=. 2023 , organization=

work page 2023
[15]

Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi , journal=. Sheared. 2023 , url=

work page 2023
[16]

Fast inference from

Leviathan, Yaniv and Kalman, Matan and Matias, Yossi , booktitle=. Fast inference from. 2023 , organization=

work page 2023
[17]

2023 , organization=

Liu, Zichang and Wang, Jue and Dao, Tri and Zhou, Tianyi and Yuan, Binhang and Song, Zhao and Shrivastava, Anshumali and Zhang, Ce and Tian, Yuandong and Re, Christopher and others , booktitle=. 2023 , organization=

work page 2023
[18]

2023 , url=

Song, Yixin and Mi, Zeyu and Xie, Haotong and Chen, Haibo , journal=. 2023 , url=

work page 2023
[19]

Adversarial robustness of sparse local

Muthukumar, Ramchandran and Sulam, Jeremias , journal=. Adversarial robustness of sparse local. 2023 , publisher=

work page 2023
[20]

How can we be so dense?

Ahmad, Subutai and Scheinkman, Luiz , journal=. How can we be so dense?. 2019 , url=

work page 2019
[21]

Adaptively Sparse

Correia, Gon. Adaptively Sparse. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=. 2019 , url=

work page 2019
[22]

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in

Li, Zonglin and You, Chong and Bhojanapalli, Srinadh and Li, Daliang and Rawat, Ankit Singh and Reddi, Sashank J and Ye, Ke and Chern, Felix and Yu, Felix and Guo, Ruiqi and others , booktitle=. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in. 2022 , url=

work page 2022
[23]

2022 , url=

Zhang, Susan and Roller, Stephen and Goyal, Naman and Artetxe, Mikel and Chen, Moya and Chen, Shuohui and Dewan, Christopher and Diab, Mona and Li, Xian and Lin, Xi Victoria and others , journal=. 2022 , url=

work page 2022
[24]

Deep learning using rectified linear units (

Agarap, Abien Fred , journal=. Deep learning using rectified linear units (. 2018 , url=

work page 2018
[25]

2023 , url=

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=. 2023 , url=

work page 2023
[26]

Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, M. The. arXiv preprint arXiv:2311.16867 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

2023 , url=

Mirzadeh, Iman and Alizadeh, Keivan and Mehta, Sachin and Del Mundo, Carlo C and Tuzel, Oncel and Samei, Golnoosh and Rastegari, Mohammad and Farajtabar, Mehrdad , journal=. 2023 , url=

work page 2023
[28]

arXiv preprint arXiv:2010.01048 , year=

The Efficacy of L_1 Regularization in Two-Layer Neural Networks , author=. arXiv preprint arXiv:2010.01048 , year=

work page arXiv 2010
[29]

Neural Networks , volume=

Transformed L_1 regularization for learning sparse deep neural networks , author=. Neural Networks , volume=. 2019 , publisher=

work page 2019
[30]

Gaussian error linear units (

Hendrycks, Dan and Gimpel, Kevin , journal=. Gaussian error linear units (. 2016 , url=

work page 2016
[31]

Neural networks , volume=

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning , author=. Neural networks , volume=. 2018 , publisher=

work page 2018
[32]

L_2 regularization, and rotational invariance , author=

Feature selection, L_1 vs. L_2 regularization, and rotational invariance , author=. Proceedings of the twenty-first international conference on Machine learning , pages=. 2004 , url=

work page 2004
[33]

IEEE access , volume=

A survey of sparse representation: algorithms and applications , author=. IEEE access , volume=. 2015 , publisher=

work page 2015
[34]

Journal of physics: Conference series , volume=

An overview of overfitting and its solutions , author=. Journal of physics: Conference series , volume=. 2019 , organization=

work page 2019
[35]

2016 , url=

Loshchilov, Ilya and Hutter, Frank , booktitle=. 2016 , url=

work page 2016
[36]

International Conference on Machine Learning , pages=

Language modeling with gated convolutional networks , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[37]

2020 , url=

Shazeer, Noam , journal=. 2020 , url=

work page 2020
[38]

2022 , url=

Han, Xu and Zeng, Guoyang and Zhao, Weilin and Liu, Zhiyuan and Zhang, Zhengyan and Zhou, Jie and Zhang, Jun and Chao, Jia and Sun, Maosong , booktitle=. 2022 , url=

work page 2022
[39]

International Conference on Machine Learning , pages=

Flexgen: High-throughput generative inference of large language models with a single GPU , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[40]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

2020 , url=

Bisk, Yonatan and Zellers, Rowan and Gao, Jianfeng and Choi, Yejin and others , booktitle=. 2020 , url=

work page 2020
[43]

2019 , url=

Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle=. 2019 , url=

work page 2019
[44]

2019 , url=

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=. 2019 , url=

work page 2019
[45]

2020 , url=

Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin , booktitle=. 2020 , url=

work page 2020
[46]

Think you have Solved Question Answering?

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have Solved Question Answering?. 2018 , url=

work page 2018
[47]

2019 , url=

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , booktitle=. 2019 , url=

work page 2019
[48]

2011 AAAI Spring Symposium Series , year=

Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

work page 2011
[49]

2019 , url=

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=. 2019 , url=

work page 2019
[50]

Paperno, Denis and Kruszewski, Germ. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2016 , url=

work page 2016
[51]

2020 , url=

Clark, Jonathan H and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , journal=. 2020 , url=

work page 2020
[52]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[54]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

2023 , url=

Zhong, Wanjun and Cui, Ruixiang and Guo, Yiduo and Liang, Yaobo and Lu, Shuai and Wang, Yanlin and Saied, Amin and Chen, Weizhu and Duan, Nan , journal=. 2023 , url=

work page 2023
[56]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[57]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding , author=. arXiv preprint arXiv:1510.00149 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Quantization and training of neural networks for efficient integer-arithmetic-only inference , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=. 2018 , url=

work page 2018
[59]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Data-free quantization through weight equalization and bias correction , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=. 2019 , url=

work page 2019
[60]

International Conference on Machine Learning , pages=

Improving neural network quantization without retraining using outlier channel splitting , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[61]

Advances in neural information processing systems , volume=

Learning both weights and connections for efficient neural network , author=. Advances in neural information processing systems , volume=. 2015 , url=

work page 2015
[62]

2023 , url=

Ma, Xinyin and Fang, Gongfan and Wang, Xinchao , journal=. 2023 , url=

work page 2023
[63]

International Conference on Learning Representations , year=

Pruning Convolutional Neural Networks for Resource Efficient Inference , author=. International Conference on Learning Representations , year=

work page
[64]

The Journal of Machine Learning Research , volume=

Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks , author=. The Journal of Machine Learning Research , volume=. 2021 , publisher=

work page 2021
[65]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Distilling task-specific knowledge from bert into simple neural networks , author=. arXiv preprint arXiv:1903.12136 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1903
[67]

Training data-efficient image

Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and J. Training data-efficient image. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[68]

MiniLLM: On-Policy Distillation of Large Language Models

Knowledge Distillation of Large Language Models , author=. arXiv preprint arXiv:2306.08543 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes , author=. arXiv preprint arXiv:2305.02301 , year=

work page internal anchor Pith review arXiv
[70]

Wang, Yiding and Chen, Kai and Tan, Haisheng and Guo, Kun , booktitle=. Tabi:. 2023 , url=

work page 2023
[71]

Accelerating Large Language Model Decoding with Speculative Sampling

Accelerating large language model decoding with speculative sampling , author=. arXiv preprint arXiv:2302.01318 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

2023 , url=

Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Wong, Rae Ying Yee and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , journal=. 2023 , url=

work page 2023
[73]

Zeroquant: Efficient and affordable post-training quantization for large-scale

Yao, Zhewei and Yazdani Aminabadi, Reza and Zhang, Minjia and Wu, Xiaoxia and Li, Conglong and He, Yuxiong , journal=. Zeroquant: Efficient and affordable post-training quantization for large-scale. 2022 , url=

work page 2022
[74]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Massive language models can be accurately pruned in one-shot , author=. arXiv preprint arXiv:2301.00774 , year=

work page arXiv
[75]

2023 , url=

Zheng, Ningxin and Jiang, Huiqiang and Zhang, Quanlu and Han, Zhenhua and Ma, Lingxiao and Yang, Yuqing and Yang, Fan and Zhang, Chengruidong and Qiu, Lili and Yang, Mao and others , booktitle=. 2023 , url=

work page 2023
[76]

Advances in neural information processing systems , volume=

Learning structured sparsity in deep neural networks , author=. Advances in neural information processing systems , volume=. 2016 , url=

work page 2016
[77]

International Conference on Learning Representations , year=

Exploring Sparsity in Recurrent Neural Networks , author=. International Conference on Learning Representations , year=

work page
[78]

The Power of Sparsity in Convolutional Neural Networks

The power of sparsity in convolutional neural networks , author=. arXiv preprint arXiv:1702.06257 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

An Image is Worth 16x16 Words:

Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others , booktitle=. An Image is Worth 16x16 Words:. 2020 , url=

work page 2020
[80]

2023 , url=

Alizadeh, Keivan and Mirzadeh, Iman and Belenko, Dmitry and Khatamifard, Karen and Cho, Minsik and Del Mundo, Carlo C and Rastegari, Mohammad and Farajtabar, Mehrdad , journal=. 2023 , url=

work page 2023

Showing first 80 references.

[1] [1]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=. 2020 , url=

work page 2020

[2] [2]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

work page 2022

[5] [5]

2023 , url =

OpenAI , title =. 2023 , url =

work page 2023

[6] [6]

2023 , url=

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=. 2023 , url=

work page 2023

[7] [7]

2023 , url=

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=. 2023 , url=

work page 2023

[8] [8]

Efficiently scaling

Pope, Reiner and Douglas, Sholto and Chowdhery, Aakanksha and Devlin, Jacob and Bradbury, James and Heek, Jonathan and Xiao, Kefan and Agrawal, Shivani and Dean, Jeff , journal=. Efficiently scaling. 2023 , url=

work page 2023

[9] [9]

2022 , organization=

Aminabadi, Reza Yazdani and Rajbhandari, Samyam and Awan, Ammar Ahmad and Li, Cheng and Li, Du and Zheng, Elton and Ruwase, Olatunji and Smith, Shaden and Zhang, Minjia and Rasley, Jeff and others , booktitle=. 2022 , organization=

work page 2022

[10] [10]

International Conference on Machine Learning , pages=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[11] [11]

Advances in Neural Information Processing Systems , volume=

Towards efficient post-training quantization of pre-trained language models , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

work page 2022

[12] [12]

A comprehensive study on post-training quantization for large language models

A comprehensive study on post-training quantization for large language models , author=. arXiv preprint arXiv:2303.08302 , year=

work page arXiv

[13] [13]

A Simple and Effective Pruning Approach for Large Language Models

A Simple and Effective Pruning Approach for Large Language Models , author=. arXiv preprint arXiv:2306.11695 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

2023 , organization=

Frantar, Elias and Alistarh, Dan , booktitle=. 2023 , organization=

work page 2023

[15] [15]

Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi , journal=. Sheared. 2023 , url=

work page 2023

[16] [16]

Fast inference from

Leviathan, Yaniv and Kalman, Matan and Matias, Yossi , booktitle=. Fast inference from. 2023 , organization=

work page 2023

[17] [17]

2023 , organization=

Liu, Zichang and Wang, Jue and Dao, Tri and Zhou, Tianyi and Yuan, Binhang and Song, Zhao and Shrivastava, Anshumali and Zhang, Ce and Tian, Yuandong and Re, Christopher and others , booktitle=. 2023 , organization=

work page 2023

[18] [18]

2023 , url=

Song, Yixin and Mi, Zeyu and Xie, Haotong and Chen, Haibo , journal=. 2023 , url=

work page 2023

[19] [19]

Adversarial robustness of sparse local

Muthukumar, Ramchandran and Sulam, Jeremias , journal=. Adversarial robustness of sparse local. 2023 , publisher=

work page 2023

[20] [20]

How can we be so dense?

Ahmad, Subutai and Scheinkman, Luiz , journal=. How can we be so dense?. 2019 , url=

work page 2019

[21] [21]

Adaptively Sparse

Correia, Gon. Adaptively Sparse. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=. 2019 , url=

work page 2019

[22] [22]

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in

Li, Zonglin and You, Chong and Bhojanapalli, Srinadh and Li, Daliang and Rawat, Ankit Singh and Reddi, Sashank J and Ye, Ke and Chern, Felix and Yu, Felix and Guo, Ruiqi and others , booktitle=. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in. 2022 , url=

work page 2022

[23] [23]

2022 , url=

Zhang, Susan and Roller, Stephen and Goyal, Naman and Artetxe, Mikel and Chen, Moya and Chen, Shuohui and Dewan, Christopher and Diab, Mona and Li, Xian and Lin, Xi Victoria and others , journal=. 2022 , url=

work page 2022

[24] [24]

Deep learning using rectified linear units (

Agarap, Abien Fred , journal=. Deep learning using rectified linear units (. 2018 , url=

work page 2018

[25] [25]

2023 , url=

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=. 2023 , url=

work page 2023

[26] [26]

Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, M. The. arXiv preprint arXiv:2311.16867 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

2023 , url=

Mirzadeh, Iman and Alizadeh, Keivan and Mehta, Sachin and Del Mundo, Carlo C and Tuzel, Oncel and Samei, Golnoosh and Rastegari, Mohammad and Farajtabar, Mehrdad , journal=. 2023 , url=

work page 2023

[28] [28]

arXiv preprint arXiv:2010.01048 , year=

The Efficacy of L_1 Regularization in Two-Layer Neural Networks , author=. arXiv preprint arXiv:2010.01048 , year=

work page arXiv 2010

[29] [29]

Neural Networks , volume=

Transformed L_1 regularization for learning sparse deep neural networks , author=. Neural Networks , volume=. 2019 , publisher=

work page 2019

[30] [30]

Gaussian error linear units (

Hendrycks, Dan and Gimpel, Kevin , journal=. Gaussian error linear units (. 2016 , url=

work page 2016

[31] [31]

Neural networks , volume=

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning , author=. Neural networks , volume=. 2018 , publisher=

work page 2018

[32] [32]

L_2 regularization, and rotational invariance , author=

Feature selection, L_1 vs. L_2 regularization, and rotational invariance , author=. Proceedings of the twenty-first international conference on Machine learning , pages=. 2004 , url=

work page 2004

[33] [33]

IEEE access , volume=

A survey of sparse representation: algorithms and applications , author=. IEEE access , volume=. 2015 , publisher=

work page 2015

[34] [34]

Journal of physics: Conference series , volume=

An overview of overfitting and its solutions , author=. Journal of physics: Conference series , volume=. 2019 , organization=

work page 2019

[35] [35]

2016 , url=

Loshchilov, Ilya and Hutter, Frank , booktitle=. 2016 , url=

work page 2016

[36] [36]

International Conference on Machine Learning , pages=

Language modeling with gated convolutional networks , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017

[37] [37]

2020 , url=

Shazeer, Noam , journal=. 2020 , url=

work page 2020

[38] [38]

2022 , url=

Han, Xu and Zeng, Guoyang and Zhao, Weilin and Liu, Zhiyuan and Zhang, Zhengyan and Zhou, Jie and Zhang, Jun and Chao, Jia and Sun, Maosong , booktitle=. 2022 , url=

work page 2022

[39] [39]

International Conference on Machine Learning , pages=

Flexgen: High-throughput generative inference of large language models with a single GPU , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[40] [40]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

2020 , url=

Bisk, Yonatan and Zellers, Rowan and Gao, Jianfeng and Choi, Yejin and others , booktitle=. 2020 , url=

work page 2020

[43] [43]

2019 , url=

Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle=. 2019 , url=

work page 2019

[44] [44]

2019 , url=

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=. 2019 , url=

work page 2019

[45] [45]

2020 , url=

Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin , booktitle=. 2020 , url=

work page 2020

[46] [46]

Think you have Solved Question Answering?

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have Solved Question Answering?. 2018 , url=

work page 2018

[47] [47]

2019 , url=

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , booktitle=. 2019 , url=

work page 2019

[48] [48]

2011 AAAI Spring Symposium Series , year=

Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

work page 2011

[49] [49]

2019 , url=

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=. 2019 , url=

work page 2019

[50] [50]

Paperno, Denis and Kruszewski, Germ. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2016 , url=

work page 2016

[51] [51]

2020 , url=

Clark, Jonathan H and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , journal=. 2020 , url=

work page 2020

[52] [52]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[54] [54]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

2023 , url=

Zhong, Wanjun and Cui, Ruixiang and Guo, Yiduo and Liang, Yaobo and Lu, Shuai and Wang, Yanlin and Saied, Amin and Chen, Weizhu and Duan, Nan , journal=. 2023 , url=

work page 2023

[56] [56]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[57] [57]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding , author=. arXiv preprint arXiv:1510.00149 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Quantization and training of neural networks for efficient integer-arithmetic-only inference , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=. 2018 , url=

work page 2018

[59] [59]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Data-free quantization through weight equalization and bias correction , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=. 2019 , url=

work page 2019

[60] [60]

International Conference on Machine Learning , pages=

Improving neural network quantization without retraining using outlier channel splitting , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019

[61] [61]

Advances in neural information processing systems , volume=

Learning both weights and connections for efficient neural network , author=. Advances in neural information processing systems , volume=. 2015 , url=

work page 2015

[62] [62]

2023 , url=

Ma, Xinyin and Fang, Gongfan and Wang, Xinchao , journal=. 2023 , url=

work page 2023

[63] [63]

International Conference on Learning Representations , year=

Pruning Convolutional Neural Networks for Resource Efficient Inference , author=. International Conference on Learning Representations , year=

work page

[64] [64]

The Journal of Machine Learning Research , volume=

Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks , author=. The Journal of Machine Learning Research , volume=. 2021 , publisher=

work page 2021

[65] [65]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Distilling task-specific knowledge from bert into simple neural networks , author=. arXiv preprint arXiv:1903.12136 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1903

[67] [67]

Training data-efficient image

Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and J. Training data-efficient image. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[68] [68]

MiniLLM: On-Policy Distillation of Large Language Models

Knowledge Distillation of Large Language Models , author=. arXiv preprint arXiv:2306.08543 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes , author=. arXiv preprint arXiv:2305.02301 , year=

work page internal anchor Pith review arXiv

[70] [70]

Wang, Yiding and Chen, Kai and Tan, Haisheng and Guo, Kun , booktitle=. Tabi:. 2023 , url=

work page 2023

[71] [71]

Accelerating Large Language Model Decoding with Speculative Sampling

Accelerating large language model decoding with speculative sampling , author=. arXiv preprint arXiv:2302.01318 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

2023 , url=

Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Wong, Rae Ying Yee and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , journal=. 2023 , url=

work page 2023

[73] [73]

Zeroquant: Efficient and affordable post-training quantization for large-scale

Yao, Zhewei and Yazdani Aminabadi, Reza and Zhang, Minjia and Wu, Xiaoxia and Li, Conglong and He, Yuxiong , journal=. Zeroquant: Efficient and affordable post-training quantization for large-scale. 2022 , url=

work page 2022

[74] [74]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Massive language models can be accurately pruned in one-shot , author=. arXiv preprint arXiv:2301.00774 , year=

work page arXiv

[75] [75]

2023 , url=

Zheng, Ningxin and Jiang, Huiqiang and Zhang, Quanlu and Han, Zhenhua and Ma, Lingxiao and Yang, Yuqing and Yang, Fan and Zhang, Chengruidong and Qiu, Lili and Yang, Mao and others , booktitle=. 2023 , url=

work page 2023

[76] [76]

Advances in neural information processing systems , volume=

Learning structured sparsity in deep neural networks , author=. Advances in neural information processing systems , volume=. 2016 , url=

work page 2016

[77] [77]

International Conference on Learning Representations , year=

Exploring Sparsity in Recurrent Neural Networks , author=. International Conference on Learning Representations , year=

work page

[78] [78]

The Power of Sparsity in Convolutional Neural Networks

The power of sparsity in convolutional neural networks , author=. arXiv preprint arXiv:1702.06257 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[79] [79]

An Image is Worth 16x16 Words:

Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others , booktitle=. An Image is Worth 16x16 Words:. 2020 , url=

work page 2020

[80] [80]

2023 , url=

Alizadeh, Keivan and Mirzadeh, Iman and Belenko, Dmitry and Khatamifard, Karen and Cho, Minsik and Del Mundo, Carlo C and Rastegari, Mohammad and Farajtabar, Mehrdad , journal=. 2023 , url=

work page 2023