Prune, Update and Trim: Robust Structured Pruning for Large Language Models

Diego Coello de Portugal Mecke; Lars Schmidth-Thieme; Tom Hanika

arxiv: 2605.18331 · v1 · pith:ZFAUWYSOnew · submitted 2026-05-18 · 💻 cs.LG

Prune, Update and Trim: Robust Structured Pruning for Large Language Models

Diego Coello de Portugal Mecke , Tom Hanika , Lars Schmidth-Thieme This is my paper

Pith reviewed 2026-05-20 11:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords structured pruninglarge language modelspost-training pruningFFN pruningattention head pruningextreme sparsitymodel compressiongrouped-query attention

0 comments

The pith

Putri prunes LLMs by updating remaining FFN weights after each removal and trimming attention heads one by one, allowing extreme sparsity without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Putri as a post-training pruning approach for large language models that adds three targeted changes to existing methods. It updates the surviving weights in each feed-forward network layer to offset the accuracy loss from removing less important nodes, then prunes the next layer while using those updated values. It also drops individual attention heads instead of entire layers and adapts the process for grouped-query attention models. A reader would care because LLMs are expensive to run at inference time, especially on long inputs or small devices, and effective pruning could shrink them substantially while keeping them usable.

Core claim

Putri works by first identifying and removing lower-impact hidden nodes from an FFN layer, then adjusting the remaining weights in that same layer to reduce the error introduced by the removals. It repeats this process layer by layer so that later pruning decisions reflect the updates already made. For attention, it removes specific heads rather than full layers and extends the same logic to grouped-query attention. These steps let the method reach higher sparsity levels than prior structured pruning techniques while remaining straightforward to implement.

What carries the argument

The combination of per-layer FFN weight updates that compensate for pruning error and sequential processing across layers, paired with individual attention-head removal instead of whole-layer deletion.

If this is right

Models can be reduced to very high sparsity ratios while still producing usable outputs on common evaluation tasks.
Inference cost drops for long-context or resource-limited settings without requiring full retraining.
The same procedure applies across different model sizes and architectures, including those using grouped-query attention.
Pruning decisions become locally adaptive because each layer sees the effects of prior updates.
Attention-head removal provides finer granularity than whole-layer removal, preserving more capacity at the same overall sparsity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sequential update pattern might reduce sensitivity to the exact ordering of layers compared with one-shot global pruning.
Combining this method with post-pruning quantization could produce further memory and speed gains not explored in the work.
The head-level pruning step could be tested on non-transformer architectures that still contain multi-head attention.
If the compensation updates prove stable, the approach might scale to models with hundreds of layers without drift becoming dominant.

Load-bearing premise

The updates to un-pruned FFN weights after each removal step compensate enough for the introduced error and the sequential order across layers prevents unmanageable accumulation of mistakes.

What would settle it

Measure perplexity or downstream task accuracy on a standard LLM benchmark after applying Putri at an extreme sparsity ratio such as 80 percent or higher; if performance falls below the levels reported for previous methods under identical conditions, the advantage collapses.

Figures

Figures reproduced from arXiv: 2605.18331 by Diego Coello de Portugal Mecke, Lars Schmidth-Thieme, Tom Hanika.

**Figure 2.** Figure 2: Ablation study on Qwen3-14B about the individual components of Putri. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study on Qwen3-8B about the individual components of Putri. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have experienced significant growth and development in recent years. However, performing inference on LLMs remains costly, especially for long-context inference or in resource-constrained devices. This motivates the development of new post-training pruning (PTP) methods. These methods reduce LLMs' requirements by removing a substantial part of the model's parameters. The discarded weights are selected depending on their impact on the models performance. Current PTP methods prune the models by removing the less informative hidden nodes from the FFN layers, and the least important attention layers. We propose Putri, a PTP method that introduces three changes to the State- of-the-art. First, we update the un-pruned weights of the FFN to compensate for the introduced pruning error. Second, the FFN layers are pruned sequentially, taking into account the updates done to the previous layers. Third, instead of removing full attention layers, we remove individual attention-heads. We extend this method such that it can also address Grouped-Query Attention. In summary, Putri is a structure pruning method which remains simple while showing SOTA performance. Pruning experiments on multiple models with a wide variety of sparsity ranges and on different datasets, validate the generality of Putri. Notably, we demonstrate that, unlike previous methods, Putri can prune LLMs on extreme sparsity ratios. The code is available at: https://github.com/Coello-dev/Putri.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Putri combines weight updates after pruning, sequential FFN steps, and head-level attention removal to reach extreme sparsity in LLMs with claimed SOTA results.

read the letter

Putri stands out for combining weight compensation updates, sequential FFN pruning, and per-head attention removal to push LLMs to extreme sparsity levels while claiming to hold performance better than prior post-training methods. The new elements are the explicit update step to fix pruning errors in the FFN and the shift to head-level pruning instead of dropping whole attention layers. This trio, plus the extension to grouped query attention, is what they position as the advance over existing PTP approaches. The paper does well by running tests on multiple models, showing results across a range of sparsity, and releasing the code for others to check. The soft spots are fairly minor. The central results depend on those updates compensating effectively, and while the ablations seem to support it, one could wonder if the gains come partly from extra tuning or specific dataset choices. Also, the added update step during pruning might not be completely free in terms of time, even if the final model is smaller. This paper is for researchers and engineers working on model compression and efficient inference for large language models. A reader looking for practical ways to reduce LLM size without much retraining would get value from the empirical validation and the simplicity of the approach. It deserves a serious referee. The work is grounded in experiments that directly test the key ideas, and the code link lowers the barrier to verification. I would recommend putting it through peer review.

Referee Report

0 major / 3 minor

Summary. The paper proposes Putri, a post-training structured pruning method for LLMs. It prunes less informative hidden nodes from FFN layers while updating the remaining un-pruned FFN weights to compensate for introduced error, performs this pruning sequentially across layers while accounting for prior updates, and removes individual attention heads (rather than full layers), with an extension to Grouped-Query Attention. The authors claim SOTA performance across multiple models, a wide range of sparsity ratios (including extreme levels), and different datasets, supported by ablation studies and zero-shot benchmarks. Code is available at https://github.com/Coello-dev/Putri.

Significance. If the empirical results hold, this work could meaningfully advance practical LLM compression by offering a relatively simple structured pruning technique that succeeds at extreme sparsity ratios where prior methods reportedly fail. Strengths include the public GitHub code for reproducibility, ablation tables that directly test the sequential pruning and update components, and zero-shot evaluations across models and sparsity levels. These elements provide a solid empirical foundation for the central claims.

minor comments (3)

[Abstract] Abstract: The claim of SOTA performance and extreme-sparsity capability would be strengthened by briefly naming the primary metrics (e.g., perplexity or zero-shot accuracy) and models used in the headline results.
[§4] §4 (Experiments): Tables reporting main results and ablations should include error bars or standard deviations across multiple random seeds or runs to allow assessment of statistical reliability of the reported gains over baselines.
[§3] Method description: The precise procedure for computing the FFN weight updates (e.g., closed-form solution, iterative solver, or regularization) should be stated explicitly, perhaps with pseudocode, to clarify implementation details even though ablations test the overall effect.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript and for recommending minor revision. We appreciate the recognition of Putri's potential contribution to practical LLM compression at extreme sparsity levels, as well as the value placed on our public code release and ablation studies.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical post-training pruning algorithm (Putri) consisting of three procedural modifications to prior structured pruning baselines: per-layer FFN weight updates to offset pruning error, sequential pruning that incorporates prior-layer updates, and head-level (rather than layer-level) attention pruning. These steps are presented as algorithmic choices whose effectiveness is assessed via ablation tables and zero-shot benchmarks across models and sparsity regimes; no derivation, equation, or first-principles claim is advanced that reduces to a fitted parameter, self-definition, or self-citation chain. The supplied GitHub implementation supplies an independent reproducibility route, confirming that the central performance claims rest on external experimental evidence rather than internal logical closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical performance rather than new axioms or invented entities. No explicit free parameters are named in the abstract, though the method implicitly depends on choices of sparsity schedule and importance metric that are fitted or tuned per model.

pith-pipeline@v0.9.0 · 5799 in / 1148 out tokens · 31352 ms · 2026-05-20T11:58:47.345890+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

score(node(l)_i) = ||z(l)_i||_2^2 ... argmin ŴP ||XW - XP ŴP||_2^2 = (XP^T XP)^{-1} XP^T XW
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We sequentially prune the FFN layers, allowing each pruning decision to account for the perturbations introduced by the layers that were pruned earlier.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 8 internal anchors

[1]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018
[2]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[4]

Schulman, Arnold Milstein, Demetri Terzopoulos, Ade Famoti, Noboru Kuno, Ashley J

Zane Durante, Bidipta Sarkar, Ran Gong, Rohan Taori, Yusuke Noda, Paul Tang, Ehsan Adeli, Shrinidhi Kowshika Lakshmikanth, Kevin Schulman, Arnold Milstein, et al. An interactive agent foundation model.arXiv preprint arXiv:2402.05929, 2024

work page arXiv 2024
[5]

More agents is all you need

junyou li, Qin Zhang, Yangbin Yu, QIANG FU, and Deheng Ye. More agents is all you need. Transactions on Machine Learning Research, 2024

work page 2024
[6]

Os-copilot: Towards generalist computer agents with self-improvement

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

work page 2024
[7]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision- language-action models transfer web knowledge to robotic control, 2023.URL https://arxiv. org/abs/2307.15818, 1:2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Integer quantization for deep learning inference: Principles and empirical evaluation, 2020

Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: Principles and empirical evaluation, 2020

work page 2020
[12]

Optimal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989

work page 1989
[13]

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational Conference on Machine Learning, pages 10323–10337. PMLR, 2023

work page 2023
[15]

Besa: Pruning large language models with blockwise parameter-efficient sparsity allocation, 2024

Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, and Ping Luo. Besa: Pruning large language models with blockwise parameter-efficient sparsity allocation, 2024. 10

work page 2024
[16]

Plug-and-play: An efficient post-training pruning method for large language models

Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. Plug-and-play: An efficient post-training pruning method for large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[17]

Pytorch: An imperative style, high-performance deep learning library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

work page 2019
[18]

Keras, 2015

Francois Chollet et al. Keras, 2015

work page 2015
[19]

Deepsparse, 2022

NeuralMagic. Deepsparse, 2022

work page 2022
[20]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

work page 2023
[21]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024

work page arXiv 2024
[22]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, 2025

work page 2025
[23]

Huang, H

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models.arXiv preprint arXiv:2405.14917, 2024

work page arXiv 2024
[24]

2SSP: A two-stage framework for structured pruning of LLMs.Transactions on Machine Learning Research, 2025

Fabrizio Sandri, Elia Cunegatti, and Giovanni Iacca. 2SSP: A two-stage framework for structured pruning of LLMs.Transactions on Machine Learning Research, 2025

work page 2025
[25]

A simple and effective pruning approach for large language models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[26]

Stade: Standard deviation as a pruning metric.arXiv preprint arXiv:2503.22451, 2025

Diego Coello de Portugal Mecke, Haya Alyoussef, Maximilian Stubbemann, Ilia Koloiarov, Tom Hanika, and Lars Schmidt-Thieme. Stade: Standard deviation as a pruning metric.arXiv preprint arXiv:2503.22451, 2025

work page arXiv 2025
[27]

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. The unreasonable ineffectiveness of the deeper layers, 2025

work page 2025
[28]

Blockpruner: Fine- grained pruning for large language models

Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, and Liangzhi Li. Blockpruner: Fine- grained pruning for large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 5065–5080, 2025

work page 2025
[29]

Evopress: Accurate dynamic model compression via evolutionary search

Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, and Dan Alistarh. Evopress: Accurate dynamic model compression via evolutionary search. InInternational Conference on Machine Learning, pages 55556–55590. PMLR, 2025

work page 2025
[30]

Fast transformer decoding: One write-head is all you need, 2019

Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019

work page 2019
[31]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

work page 2023
[32]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Root mean square layer normalization, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization, 2019

work page 2019
[35]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[36]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv e-prints, 2019

work page 2019
[37]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2022

work page 2022
[38]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[39]

A framework for few-shot language model evaluation, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page 2024
[40]

Winogrande: An adversarial winograd schema challenge at scale

Sakaguchi Keisuke, Le Bras Ronan, Bhagavatula Chandra, and Choi Yejin. Winogrande: An adversarial winograd schema challenge at scale. 2019

work page 2019
[41]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[43]

Piqa: An algebra for querying protein data sets

Sandeep Tata and Jignesh M Patel. Piqa: An algebra for querying protein data sets. In15th International Conference on Scientific and Statistical Database Management, 2003., pages 141–150. IEEE, 2003

work page 2003
[44]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021. A Appendix A.1 Additional results As mentioned in the main paper, we delegate some of the results to this section due...

work page 2021

[1] [1]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018

[2] [2]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019

[3] [3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[4] [4]

Schulman, Arnold Milstein, Demetri Terzopoulos, Ade Famoti, Noboru Kuno, Ashley J

Zane Durante, Bidipta Sarkar, Ran Gong, Rohan Taori, Yusuke Noda, Paul Tang, Ehsan Adeli, Shrinidhi Kowshika Lakshmikanth, Kevin Schulman, Arnold Milstein, et al. An interactive agent foundation model.arXiv preprint arXiv:2402.05929, 2024

work page arXiv 2024

[5] [5]

More agents is all you need

junyou li, Qin Zhang, Yangbin Yu, QIANG FU, and Deheng Ye. More agents is all you need. Transactions on Machine Learning Research, 2024

work page 2024

[6] [6]

Os-copilot: Towards generalist computer agents with self-improvement

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

work page 2024

[7] [7]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision- language-action models transfer web knowledge to robotic control, 2023.URL https://arxiv. org/abs/2307.15818, 1:2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Integer quantization for deep learning inference: Principles and empirical evaluation, 2020

Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: Principles and empirical evaluation, 2020

work page 2020

[12] [12]

Optimal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989

work page 1989

[13] [13]

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational Conference on Machine Learning, pages 10323–10337. PMLR, 2023

work page 2023

[15] [15]

Besa: Pruning large language models with blockwise parameter-efficient sparsity allocation, 2024

Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, and Ping Luo. Besa: Pruning large language models with blockwise parameter-efficient sparsity allocation, 2024. 10

work page 2024

[16] [16]

Plug-and-play: An efficient post-training pruning method for large language models

Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. Plug-and-play: An efficient post-training pruning method for large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[17] [17]

Pytorch: An imperative style, high-performance deep learning library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

work page 2019

[18] [18]

Keras, 2015

Francois Chollet et al. Keras, 2015

work page 2015

[19] [19]

Deepsparse, 2022

NeuralMagic. Deepsparse, 2022

work page 2022

[20] [20]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

work page 2023

[21] [21]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024

work page arXiv 2024

[22] [22]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, 2025

work page 2025

[23] [23]

Huang, H

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models.arXiv preprint arXiv:2405.14917, 2024

work page arXiv 2024

[24] [24]

2SSP: A two-stage framework for structured pruning of LLMs.Transactions on Machine Learning Research, 2025

Fabrizio Sandri, Elia Cunegatti, and Giovanni Iacca. 2SSP: A two-stage framework for structured pruning of LLMs.Transactions on Machine Learning Research, 2025

work page 2025

[25] [25]

A simple and effective pruning approach for large language models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[26] [26]

Stade: Standard deviation as a pruning metric.arXiv preprint arXiv:2503.22451, 2025

Diego Coello de Portugal Mecke, Haya Alyoussef, Maximilian Stubbemann, Ilia Koloiarov, Tom Hanika, and Lars Schmidt-Thieme. Stade: Standard deviation as a pruning metric.arXiv preprint arXiv:2503.22451, 2025

work page arXiv 2025

[27] [27]

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. The unreasonable ineffectiveness of the deeper layers, 2025

work page 2025

[28] [28]

Blockpruner: Fine- grained pruning for large language models

Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, and Liangzhi Li. Blockpruner: Fine- grained pruning for large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 5065–5080, 2025

work page 2025

[29] [29]

Evopress: Accurate dynamic model compression via evolutionary search

Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, and Dan Alistarh. Evopress: Accurate dynamic model compression via evolutionary search. InInternational Conference on Machine Learning, pages 55556–55590. PMLR, 2025

work page 2025

[30] [30]

Fast transformer decoding: One write-head is all you need, 2019

Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019

work page 2019

[31] [31]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

work page 2023

[32] [32]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Root mean square layer normalization, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization, 2019

work page 2019

[35] [35]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024

[36] [36]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv e-prints, 2019

work page 2019

[37] [37]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2022

work page 2022

[38] [38]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024

[39] [39]

A framework for few-shot language model evaluation, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page 2024

[40] [40]

Winogrande: An adversarial winograd schema challenge at scale

Sakaguchi Keisuke, Le Bras Ronan, Bhagavatula Chandra, and Choi Yejin. Winogrande: An adversarial winograd schema challenge at scale. 2019

work page 2019

[41] [41]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019

[43] [43]

Piqa: An algebra for querying protein data sets

Sandeep Tata and Jignesh M Patel. Piqa: An algebra for querying protein data sets. In15th International Conference on Scientific and Statistical Database Management, 2003., pages 141–150. IEEE, 2003

work page 2003

[44] [44]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021. A Appendix A.1 Additional results As mentioned in the main paper, we delegate some of the results to this section due...

work page 2021