arxiv: 2604.19398 · v1 · submitted 2026-04-21 · 💻 cs.AI

Recognition: unknown

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

Ziyang Wang , Jiangfeng Xiao , Chuan Xiao , Ruoxiang Li , Rui Mao , Jianbin Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords structured pruninglarge language modelsglobal budgetstraight-through estimatorFFN channelsKV headsmodel compressionzero-shot accuracy

0 comments

The pith

GRASPrune prunes 50% of LLaMA-2-7B parameters while preserving competitive accuracy using only unlabeled calibration data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GRASPrune to make large language models smaller and faster to serve after pretraining. It prunes feed-forward network channels and key-value head groups together under one overall size limit. The approach trains lightweight gates with a projected straight-through estimator that keeps the total pruning exact at every step while the main model weights stay frozen. After the pruning mask is set, scaling factors are calibrated on the kept units and folded into the weights to create a compact dense model. If successful, this reduces memory and computation costs for serving without full retraining or task-specific labeled data.

Core claim

GRASPrune jointly prunes FFN channels and KV head groups under a single global budget. It learns lightweight gate scores with a projected straight-through estimator to enforce a hard mask that satisfies the budget at every step while keeping the backbone weights frozen. After the mask is fixed, scaling factors are calibrated on the retained units to mitigate scale mismatch and folded into the pruned weights to obtain a smaller dense checkpoint with no extra parameters at inference.

What carries the argument

Lightweight gate scores trained with a projected straight-through estimator that enforces a global pruning budget on FFN channels and KV head groups.

Load-bearing premise

Lightweight gate scores learned with the projected straight-through estimator accurately identify removable FFN channels and KV head groups without requiring full-model fine-tuning or task-specific data.

What would settle it

If the pruned LLaMA-2-7B model shows perplexity significantly above 12.18 on WikiText-2 or substantially lower average zero-shot accuracy on the five benchmarks than the reported competitive levels.

Figures

Figures reproduced from arXiv: 2604.19398 by Chuan Xiao, Jianbin Qin, Jiangfeng Xiao, Rui Mao, Ruoxiang Li, Ziyang Wang.

**Figure 2.** Figure 2: Overall pipeline of GRASPrune. types under a single constraint. ZipLM aggregates multi-granularity saliency into a global ordering and selects structures directly under a target sparsity (Kurtic et al., 2023). DISP-LLM further relaxes the coupling along the embedding dimension so that different blocks can use different subsets of feature maps and widths, effectively turning pruning into a dimension-ind… view at source ↗

**Figure 3.** Figure 3: End-to-end generation throughput (tokens/s) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of global retention ratios between [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Layerwise allocation under the same global [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Mask stability comparison under identical [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Large language models (LLMs) are expensive to serve because model parameters, attention computation, and KV caches impose substantial memory and latency costs. We present GRASPrune, a structured pruning framework applied after pretraining that jointly prunes FFN channels and KV head groups under a single global budget. Instead of learning importance scores without constraints and applying the budget only after training, GRASPrune learns lightweight gate scores with a projected straight-through estimator that enforces a hard mask satisfying the budget at every step while keeping the backbone weights frozen. After the mask is fixed, we calibrate scaling factors on the retained units to mitigate scale mismatch caused by pruning, and fold these factors into the pruned weights to obtain a smaller dense checkpoint with no extra parameters at inference. On LLaMA-2-7B, GRASPrune removes 50% of parameters and achieves 12.18 perplexity on WikiText-2 while maintaining competitive average zero-shot accuracy on five benchmarks, using four epochs on 512 unlabeled calibration sequences on a single NVIDIA A100 80GB GPU without any full model fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRASPrune gives a clean global-budget way to prune FFN channels and KV heads together via projected STE gates, then fixes scales on tiny calibration data, but the reported numbers rest on untested assumptions about that fix.

read the letter

The core contribution is a pruning procedure that learns gate scores for FFN channels and KV head groups under one shared budget, using a projected straight-through estimator so the hard mask is enforced at every training step instead of applied afterward. The backbone stays frozen, and after the mask is set they run a short calibration of per-unit scaling factors on 512 unlabeled sequences before folding the scales into the weights for a dense smaller checkpoint. On LLaMA-2-7B this yields 50% parameter removal, 12.18 WikiText-2 perplexity, and competitive zero-shot accuracy on five tasks, all with four epochs on one A100 and no full fine-tuning. That combination of global enforcement and minimal post-processing is the concrete technical choice not already in the cited pruning papers, and the resource numbers make it easy to reproduce for serving-cost work. The method is honest about staying empirical and using standard held-out benchmarks rather than circular internal metrics. The soft spot is the calibration step. Pruning half the FFN channels and KV heads changes activation scales and covariances across layers, and the abstract gives no ablation on calibration data volume, no before-after scale statistics, and no comparison to simply renormalizing by retained count. If 512 sequences are too small or unrepresentative, the reported perplexity and accuracy could be inflated by the post-hoc fix rather than by the gates themselves. The stress-test concern about residual distribution mismatch therefore lands. This is for people who need structured pruning recipes that run on modest hardware without retraining the whole model. A reader working on LLM compression or efficient inference would pick up the global-budget enforcement idea and the calibration trick to try on their own setups. The paper shows clear thinking on the mechanics and cites the relevant prior work without overclaiming, so it deserves a serious referee even though the experiments need more controls and comparisons. I would send it to review with requests for those ablations and for error bars on the main numbers.

Referee Report

2 major / 1 minor

Summary. The paper presents GRASPrune, a post-pretraining structured pruning framework for LLMs that jointly prunes FFN channels and KV head groups under a single global budget. It learns lightweight gate scores via a projected straight-through estimator to enforce a hard mask at every training step while keeping backbone weights frozen, then calibrates per-unit scaling factors on 512 unlabeled sequences to produce a smaller dense checkpoint. On LLaMA-2-7B with 50% parameter removal, it reports 12.18 perplexity on WikiText-2 and competitive average zero-shot accuracy across five benchmarks, using four epochs on a single A100 GPU without full-model fine-tuning.

Significance. If the performance numbers prove robust, the method offers a low-overhead route to structured pruning of LLMs that avoids task-specific data and full fine-tuning, which could aid memory-constrained deployment. The global-budget constraint enforced during gate learning is a technically interesting design choice compared to post-hoc thresholding approaches.

major comments (2)

[Abstract] Abstract: The claim that calibrating scaling factors on 512 sequences suffices to mitigate scale mismatch after 50% FFN/KV pruning is load-bearing for the reported 12.18 WikiText-2 perplexity and zero-shot scores, yet the manuscript supplies no ablation on calibration data volume, no per-layer activation scale or covariance statistics before versus after pruning, and no comparison against simple renormalization by retained-channel count. Pruning necessarily shifts layer-wise distributions, so residual mismatch could inflate the numbers.
[Experimental results] Experimental results (as summarized in the abstract): Concrete perplexity and accuracy figures are given without error bars, standard deviations across runs, or detailed baseline comparisons to other structured pruning methods. This absence makes it impossible to determine whether the global gating plus calibration combination delivers a statistically meaningful improvement or whether the results are sensitive to the particular 512-sequence calibration set.

minor comments (1)

[Abstract] The abstract introduces the projected straight-through estimator without a short parenthetical explanation or citation; a one-sentence gloss would improve accessibility for readers outside the pruning literature.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on GRASPrune. We address each major comment below and describe the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that calibrating scaling factors on 512 sequences suffices to mitigate scale mismatch after 50% FFN/KV pruning is load-bearing for the reported 12.18 WikiText-2 perplexity and zero-shot scores, yet the manuscript supplies no ablation on calibration data volume, no per-layer activation scale or covariance statistics before versus after pruning, and no comparison against simple renormalization by retained-channel count. Pruning necessarily shifts layer-wise distributions, so residual mismatch could inflate the numbers.

Authors: We agree that further analysis of the calibration step is warranted. In the revised manuscript we will add an ablation varying the calibration set size (128, 256, 512, and 1024 sequences) and report the resulting WikiText-2 perplexity and zero-shot accuracies to show that performance stabilizes near 512 sequences. We will also include a table of per-layer activation means and variances before and after pruning to quantify the scale shift, and we will add a simple renormalization baseline (scaling factors set by the ratio of retained channels) to demonstrate that the learned calibration outperforms this heuristic. revision: yes
Referee: [Experimental results] Experimental results (as summarized in the abstract): Concrete perplexity and accuracy figures are given without error bars, standard deviations across runs, or detailed baseline comparisons to other structured pruning methods. This absence makes it impossible to determine whether the global gating plus calibration combination delivers a statistically meaningful improvement or whether the results are sensitive to the particular 512-sequence calibration set.

Authors: We recognize the importance of statistical reporting and broader baselines. Due to the substantial compute required for repeated full pruning runs, we cannot provide standard deviations from multiple independent trials; we will instead add a limitations paragraph stating this constraint and reporting the exact random seed and data order used. We will expand the experiments section with additional structured-pruning baselines (e.g., magnitude-based and other gating methods) evaluated under the same protocol, and we will report results on two additional random 512-sequence calibration subsets to illustrate sensitivity. revision: partial

standing simulated objections not resolved

Providing standard deviations across multiple independent runs of the full GRASPrune procedure, as the computational cost precludes additional trials at this time.

Circularity Check

0 steps flagged

No significant circularity: empirical pruning method validated on external benchmarks

full rationale

The paper describes an empirical algorithm: lightweight gates trained via projected straight-through estimator to enforce a global pruning budget on FFN channels and KV heads while freezing backbone weights, followed by calibration of per-unit scaling factors on 512 unlabeled sequences and folding into a dense checkpoint. All reported results (WikiText-2 perplexity, zero-shot accuracies) are measured on standard held-out benchmarks external to the training and calibration data. No derivation step equates a claimed prediction to its own fitted inputs by construction, no self-citation is used as a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The chain is a practical procedure whose success is independently falsifiable on public test sets.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions of gradient-based optimization and the validity of the straight-through estimator approximation; no new physical or mathematical entities are postulated.

free parameters (2)

global pruning budget
The target fraction of parameters to remove is chosen by the user and directly determines the mask.
gate learning hyperparameters
Learning rate, number of epochs, and calibration sequence count are selected to produce usable gates.

axioms (2)

domain assumption The straight-through estimator provides a usable gradient signal for discrete mask decisions.
Invoked to train the gates while enforcing hard constraints.
domain assumption Scale mismatch after pruning can be corrected by a simple multiplicative factor per retained unit.
Used to justify the post-pruning calibration step.

pith-pipeline@v0.9.0 · 5508 in / 1509 out tokens · 38918 ms · 2026-05-10T02:20:48.729988+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 16 canonical work pages · 6 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[2]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
[3]

GPT-4 Technical Report

2023 , eprint =. doi:10.48550/arXiv.2303.08774 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[4]

2023 , eprint=

Challenges and Applications of Large Language Models , author=. 2023 , eprint=

2023
[5]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=

work page internal anchor Pith review arXiv 1910
[7]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review arXiv
[8]

Proceedings of machine learning and systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=
[9]

SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[10]

International Conference on Learning Representations (ICLR) , year =

Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression , author =. International Conference on Learning Representations (ICLR) , year =
[11]

Advances in neural information processing systems , volume=

Learning both weights and connections for efficient neural network , author=. Advances in neural information processing systems , volume=
[12]

Ma, Xinyin and Fang, Gongfan and Wang, Xinchao , booktitle =
[13]

International conference on machine learning , pages=

Sparsegpt: Massive language models can be accurately pruned in one-shot , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[14]

Forty-second International Conference on Machine Learning , year=

Olica: Efficient Structured Pruning of Large Language Models without Retraining , author=. Forty-second International Conference on Machine Learning , year=
[15]

Wei Huang and Haotong Qin and Yangdong Liu and Yawei Li and Qinshuo Liu and Xianglong Liu and Luca Benini and Michele Magno and Shiming Zhang and XIAOJUAN QI , booktitle=. SliM-. 2025 , url=

2025
[16]

Structured Pruning for Efficient Generative Pre-trained Language Models

Tao, Chaofan and Hou, Lu and Bai, Haoli and Wei, Jiansheng and Jiang, Xin and Liu, Qun and Luo, Ping and Wong, Ngai. Structured Pruning for Efficient Generative Pre-trained Language Models. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.692

work page doi:10.18653/v1/2023.findings-acl.692 2023
[17]

Gui Ling and Ziyang Wang and Yuliang Yan and Qingwen Liu , booktitle=. Slim. 2024 , url=

2024
[18]

2024 , editor =

Li, Guangyan and Tang, Yongqiang and Zhang, Wensheng , booktitle =. 2024 , editor =

2024
[19]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Fluctuation-Based Adaptive Structured Pruning for Large Language Models , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2024 , month=. doi:10.1609/aaai.v38i10.28960 , abstractNote=

work page doi:10.1609/aaai.v38i10.28960 2024
[20]

2013 , eprint=

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. 2013 , eprint=

2013
[21]

BinaryConnect: Training Deep Neural Networks with binary weights during propagations , url =

Courbariaux, Matthieu and Bengio, Yoshua and David, Jean-Pierre , booktitle =. BinaryConnect: Training Deep Neural Networks with binary weights during propagations , url =
[22]

2018 , eprint=

Learning Sparse Neural Networks through L_0 Regularization , author=. 2018 , eprint=

2018
[23]

2017 , eprint=

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , author=. 2017 , eprint=

2017
[24]

2017 , eprint=

Categorical Reparameterization with Gumbel-Softmax , author=. 2017 , eprint=

2017
[25]

Eldar Kurtic and Elias Frantar and Dan Alistarh , booktitle=. Zip. 2023 , url=

2023
[26]

2024 , editor =

Meng, Xiang and Ibrahim, Shibal and Behdin, Kayhan and Hazimeh, Hussein and Ponomareva, Natalia and Mazumder, Rahul , booktitle =. 2024 , editor =

2024
[27]

arXiv preprint arXiv:2412.06419 , year=

Llm-bip: Structured pruning for large language models with block-wise forward importance propagation , author=. arXiv preprint arXiv:2412.06419 , year=

work page arXiv
[28]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Shortgpt: Layers in large language models are more redundant than you expect , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[29]

12th International Conference on Learning Representations, ICLR 2024 , year=

SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING , author=. 12th International Conference on Learning Representations, ICLR 2024 , year=

2024
[30]

Advances in Neural Information Processing Systems , volume=

A fast post-training pruning framework for transformers , author=. Advances in Neural Information Processing Systems , volume=
[31]

arXiv preprint arXiv:2503.09657 , year=

T 'yr-the-Pruner: Unlocking Accurate 50\ author=. arXiv preprint arXiv:2503.09657 , year=

work page arXiv
[32]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023
[33]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=
[34]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=
[35]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
[36]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Dodge, Jesse and Sap, Maarten and Marasovi \'c , Ana and Agnew, William and Ilharco, Gabriel and Groeneveld, Dirk and Mitchell, Margaret and Gardner, Matt. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021....

work page doi:10.18653/v1/2021.emnlp-main.98 2021
[37]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

work page internal anchor Pith review arXiv 1905
[38]

H ella S wag: Can a Machine Really Finish Your Sentence?

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1472

work page doi:10.18653/v1/p19-1472 2019
[39]

2020 , doi =

Bisk, Yonatan and Zellers, Rowan and Le Bras, Ronan and Gao, Jianfeng and Choi, Yejin , booktitle =. 2020 , doi =

2020
[40]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6399 , abstractNote=

work page doi:10.1609/aaai.v34i05.6399 2020
[41]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Can a suit of armor conduct electricity? a new dataset for open book question answering , author=. arXiv preprint arXiv:1809.02789 , year=

work page internal anchor Pith review arXiv
[42]

doi:10.5281/zenodo.10600400 , url =

Lintang Sutawika and Hailey Schoelkopf and Leo Gao and Stella Biderman and Baber Abbasi and Jonathan Tow and ben fattori and Charles Lovering and farzanehnakhaee70 and Jason Phang and Anish Thite and Fazz and Aflah and Niklas Muennighoff and Thomas Wang and sdtblck and gakada and nopperl and researcher2 and tttyuntian and Chris and Julen Etxaniz and Zdeně...

work page doi:10.5281/zenodo.10600400
[43]

Croci and Marcelo Gennari do Nascimento and Torsten Hoefler and James Hensman , booktitle=

Saleh Ashkboos and Maximilian L. Croci and Marcelo Gennari do Nascimento and Torsten Hoefler and James Hensman , booktitle=. Slice. 2024 , url=

2024
[44]

2025 , eprint=

E ^3 -Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models , author=. 2025 , eprint=

2025
[45]

and Santorini, Beatrice and Marcinkiewicz, Mary Ann

Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann. Building a Large Annotated Corpus of E nglish: The P enn T reebank. Computational Linguistics. 1993

1993
[46]

2025 , eprint=

A3 : an Analytical Low-Rank Approximation Framework for Attention , author=. 2025 , eprint=

2025
[47]

The Twelfth International Conference on Learning Representations , year=

A Simple and Effective Pruning Approach for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[48]

2025 , eprint=

Wanda++: Pruning Large Language Models via Regional Gradients , author=. 2025 , eprint=

2025
[49]

Jialong Guo and Xinghao Chen and Yehui Tang and Yunhe Wang , booktitle=. Slim. 2025 , url=

2025
[50]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

2023
[51]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
[52]

2025 , eprint=

FASP: Fast and Accurate Structured Pruning of Large Language Models , author=. 2025 , eprint=

2025
[53]

2024 , url=

Shangqian Gao and Chi-Heng Lin and Ting Hua and Zheng Tang and Yilin Shen and Hongxia Jin and Yen-Chang Hsu , booktitle=. 2024 , url=

2024
[54]

SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression , author=
[55]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

2018
[56]

Outlier Weighed Layerwise Sparsity (

Yin, Lu and Wu, You and Zhang, Zhenyu and Hsieh, Cheng-Yu and Wang, Yaqing and Jia, Yiling and Li, Gen and Jaiswal, Ajay Kumar and Pechenizkiy, Mykola and Liang, Yi and Bendersky, Michael and Wang, Zhangyang and Liu, Shiwei , booktitle =. Outlier Weighed Layerwise Sparsity (. 2024 , editor =

2024
[57]

Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu , booktitle=

Huiqiang Jiang and YUCHENG LI and Chengruidong Zhang and Qianhui Wu and Xufang Luo and Surin Ahn and Zhenhua Han and Amir H. Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu , booktitle=. 2024 , url=

2024
[58]

Chaojun Xiao and Pengle Zhang and Xu Han and Guangxuan Xiao and Yankai Lin and Zhengyan Zhang and Zhiyuan Liu and Maosong Sun , booktitle=. Inf. 2024 , url=

2024
[59]

Thirty-seventh Conference on Neural Information Processing Systems , year=

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[60]

The Twelfth International Conference on Learning Representations , year=

Efficient Streaming Language Models with Attention Sinks , author=. The Twelfth International Conference on Learning Representations , year=
[61]

Yuhong Li and Yingbing Huang and Bowen Yang and Bharat Venkitesh and Acyr Locatelli and Hanchen Ye and Tianle Cai and Patrick Lewis and Deming Chen , booktitle=. Snap. 2024 , url=

2024
[62]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021
[63]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
[64]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1300 2019
[65]

Data mining and knowledge discovery , 33(4):917–963

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. doi:10.18653/v1/W17-4413

work page doi:10.18653/v1/w17-4413 2017
[66]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[67]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025