pith. sign in

arxiv: 2606.02559 · v1 · pith:EMV2I3QNnew · submitted 2026-06-01 · 💻 cs.CL · cs.AI

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Pith reviewed 2026-06-28 14:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM compressionsubmodule replacementresidual fittingpost-training compressiontransformer sparsityattention feedforwardmodel granularity
0
0 comments X

The pith

SubFit compresses LLMs at the submodule level by replacing non-contiguous Attention and FeedForward components with individual fitted residual bypasses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that redundancy in pretrained transformers is not confined to contiguous regions and does not distribute evenly between Attention and FeedForward outputs, so existing full-layer contiguous replacement methods are overly restrictive. SubFit selects submodules non-contiguously and fits a lightweight residual bypass to each one separately, operating post-training with only calibration data. This produces the best aggregate perplexity-accuracy trade-off across ten models and five sparsity levels, with the largest gains at aggressive compression. A reader would care because the approach reduces model size and inference cost while preserving more capability than layer-level baselines.

Core claim

SubFit moves replacement-based compression from full-layer granularity with contiguous selection to submodule granularity, where Attention and FeedForward submodules are chosen independently across depth and each receives its own fitted residual bypass. Across five base and five instruction-tuned LLMs and sparsity levels from 12.5 percent to 37.5 percent, this yields the best overall perplexity-accuracy trade-off, with larger margins at higher sparsity; at 25 percent sparsity it retains 84.6 percent of dense downstream accuracy at 2.42 times perplexity degradation versus 81.6 percent and 4.34 times for the strongest baselines, plus measurable inference speedup and KV-cache savings.

What carries the argument

Submodule-level Fitted residual replacement (SubFit), which independently selects and approximates non-contiguous Attention and FeedForward submodules with per-submodule lightweight fitted residuals.

If this is right

  • SubFit improves the perplexity-accuracy trade-off at every evaluated sparsity level from 12.5 percent to 37.5 percent.
  • The gains widen under more aggressive compression.
  • The method applies equally to base and instruction-tuned models.
  • It delivers inference speedup and KV-cache reduction as direct by-products of the compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Optimal replacement strategies likely differ between Attention and FeedForward submodules, suggesting type-specific fitting could be explored further.
  • The non-contiguous selection pattern may transfer to other compression families such as structured pruning or quantization.
  • Hybrid pipelines could combine submodule replacement for some layers with full-layer methods for others.

Load-bearing premise

Redundancy in pretrained transformers is not confined to contiguous regions and does not evenly distribute between Attention and FeedForward outputs.

What would settle it

If a contiguous full-layer replacement method achieved equal or superior perplexity and downstream accuracy to SubFit on the same ten models at the same sparsity levels, the claimed advantage of submodule granularity would be falsified.

Figures

Figures reproduced from arXiv: 2606.02559 by Elia Cunegatti, Erik Nielsen, Giovanni Iacca, Marcus Vukojevic.

Figure 1
Figure 1. Figure 1: Downstream accuracy retention versus aggre [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of SUBFIT. The blue boxes are relative to the Attention submodules replacement, while the green ones are about the FeedForward submodules. Notation. For a submodule Fℓ at layer ℓ, either Attention or FFN, we denote by x ∈ R d a generic token-level input and by y = Fℓ(x) ∈ R d the corre￾sponding residual contribution to be approximated, where d is the hidden dimension, shared by both input and outp… view at source ↗
Figure 3
Figure 3. Figure 3: Fixed-selection rank sensitivity at 25% spar￾sity. FFN quality improves steadily up to rank 4096, while Attention saturates already at rank 256, confirm￾ing that the two submodule types require different ranks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at https://github.com/eliacunegatti/SubFit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SubFit, a post-training replacement-based compression technique for LLMs that selects and replaces individual Attention and FeedForward submodules (rather than full layers) in a non-contiguous manner, each with its own lightweight fitted residual bypass. It evaluates the method on ten LLMs across five sparsity levels (12.5%–37.5%) against four layer-granularity baselines, claiming superior aggregate perplexity-accuracy trade-offs, with specific gains at 25% sparsity (84.6% downstream accuracy retention and 2.42× perplexity degradation vs. 81.6% and 4.34× for the strongest baseline), plus inference and KV-cache benefits.

Significance. If the empirical comparisons hold under matched compression ratios, the result would demonstrate that submodule-level granularity can better exploit non-uniform redundancy in transformers, improving the efficiency frontier for replacement-based compression without requiring retraining. The availability of code supports reproducibility.

major comments (1)
  1. [Experimental setup / sparsity definition (likely §3.2 or §4)] The central empirical claim (abstract and §4) compares SubFit and layer-granularity baselines at identical reported sparsity percentages (e.g., 25%). However, each transformer layer contains two submodules of unequal parameter count; removing 25% of submodules therefore does not produce the same total parameter reduction, FLOPs, or functional impact as removing 25% of layers. The manuscript must define sparsity explicitly (parameter count, FLOPs, or submodule count) and demonstrate that the reported levels are matched on actual compression ratio before the perplexity-accuracy numbers can be directly compared.
minor comments (2)
  1. [Abstract] The abstract states 'measurable inference speedup and KV-cache savings' without numbers; add quantitative values or a reference to the relevant table/figure.
  2. [Method description] Notation for the fitted residual bypass (e.g., its parameter count relative to the replaced submodule) should be introduced once and used consistently in equations and text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental setup. We address the major comment below.

read point-by-point responses
  1. Referee: [Experimental setup / sparsity definition (likely §3.2 or §4)] The central empirical claim (abstract and §4) compares SubFit and layer-granularity baselines at identical reported sparsity percentages (e.g., 25%). However, each transformer layer contains two submodules of unequal parameter count; removing 25% of submodules therefore does not produce the same total parameter reduction, FLOPs, or functional impact as removing 25% of layers. The manuscript must define sparsity explicitly (parameter count, FLOPs, or submodule count) and demonstrate that the reported levels are matched on actual compression ratio before the perplexity-accuracy numbers can be directly compared.

    Authors: We agree that the manuscript requires an explicit definition of sparsity and verification that comparisons occur at matched compression ratios. Sparsity in the current version is defined as the fraction of submodules (SubFit) or layers (baselines) replaced, which does not guarantee identical parameter or FLOP reduction given the unequal sizes of Attention and FeedForward submodules. In the revision we will (1) state this definition clearly in §3, (2) add a table reporting the actual parameter reduction and estimated FLOP reduction for each reported sparsity level under both SubFit and the baselines, and (3) either adjust the sparsity levels or provide side-by-side matched-ratio comparisons so that the perplexity-accuracy results can be interpreted under equivalent compression budgets. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation is self-contained

full rationale

The paper proposes SubFit as a post-training compression technique at submodule granularity and reports direct empirical comparisons of perplexity and downstream accuracy against four replacement-based baselines across ten LLMs and five sparsity levels. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the abstract or described method. The central claims rest on external benchmark results rather than any reduction to the paper's own inputs or prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method introduces new fitted components and relies on the domain assumption about redundancy distribution; no machine-checked proofs or external benchmarks mentioned in abstract.

free parameters (1)
  • parameters of fitted residual bypasses
    Lightweight modules fitted to calibration data for each replaced submodule.
axioms (1)
  • domain assumption Redundancy in LLMs is distributed non-contiguously and differently across submodule types
    This is the key intuition stated as the basis for moving beyond full-layer contiguous selection.
invented entities (1)
  • Submodule-level fitted residual bypass no independent evidence
    purpose: To approximate removed submodules at a finer granularity
    Introduced as the core replacement mechanism in SubFit.

pith-pipeline@v0.9.1-grok · 5798 in / 1263 out tokens · 32012 ms · 2026-06-28T14:45:28.377843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

118 extracted references · 6 canonical work pages

  1. [3]

    Elia Cunegatti and Leonardo Lucio Custode and Giovanni Iacca , year = 2025, journal =

  2. [4]

    Xiaodong Chen and Yuxuan Hu and Jing Zhang and Yanling Wang and Cuiping Li and Hong Chen , year = 2025, booktitle =

  3. [5]

    Shopkhoev, Dmitriy and Ali, Ammar and Zhussip, Magauiya and Malykh, Valentin and Lefkimmiatis, Stamatios and Komodakis, Nikos and Zagoruyko, Sergey , year = 2025, booktitle =

  4. [8]

    Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , year = 2022, booktitle =

  5. [10]

    Frantar, Elias and Alistarh, Dan , year = 2023, booktitle =

  6. [11]

    Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, Zico , year = 2024, booktitle =

  7. [12]

    Ashkboos, Saleh and Croci, Maximilian and Gennari do Nascimento, Marcelo and Hoefler, Torsten and Hensman, James , year = 2024, booktitle =

  8. [13]

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , year = 2017, booktitle =

  9. [15]

    Anderson, Theodore Wilbur , year = 1951, journal =

  10. [16]

    doi:https://doi.org/10.1016/0047-259X(75)90042-1 , issn =

    Alan Julian Izenman , year = 1975, journal =. doi:https://doi.org/10.1016/0047-259X(75)90042-1 , issn =

  11. [17]

    Eckart, Carl and Young, Gale , year = 1936, journal =

  12. [18]

    Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , year = 2024, journal =

  13. [19]

    Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song , year = 2023, booktitle =

  14. [20]

    Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan , year = 2022, journal =

  15. [21]

    Shao, Wenqi and Chen, Mengzhao and Zhang, Zhaoyang and Xu, Peng and Zhao, Lirui and Li, Zhiqian and Zhang, Kaipeng and Peng, Gao and Qiao, Yu and Luo, Ping , year = 2024, booktitle =

  16. [22]

    Xinyin Ma and Gongfan Fang and Xinchao Wang , year = 2023, booktitle =

  17. [23]

    Fabrizio Sandri and Elia Cunegatti and Giovanni Iacca , year = 2025, journal =

  18. [24]

    Men, Xin and Xu, Mingyu and Zhang, Qingyu and Yuan, Qianhao and Wang, Bingning and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Chen, Weipeng , year = 2025, booktitle =

  19. [25]

    Yang, Yifei and Cao, Zouying and Zhao, Hai , year = 2024, booktitle =

  20. [26]

    Lui , year = 2026, booktitle =

    Yikun Jiang and Huanyu Wang and Tianhong Ding and Wenhu Zhang and Yiming Wu and Hanbin Zhao and John C.S. Lui , year = 2026, booktitle =

  21. [27]

    An, Yongqi and Zhao, Xu and Yu, Tao and Tang, Ming and Wang, Jinqiao , year = 2024, booktitle =

  22. [28]

    Ash and Dipendra Misra , year = 2024, booktitle =

    Pratyusha Sharma and Jordan T. Ash and Dipendra Misra , year = 2024, booktitle =

  23. [29]

    Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi , year = 2024, booktitle =

  24. [30]

    Liu, Yijiang and Yang, Huanrui and Chen, Youxin and Zhang, Rongyu and Wang, Miao and Du, Yuan and Du, Li , year = 2025, booktitle =

  25. [31]

    Liu, Zechun and Oguz, Barlas and Zhao, Changsheng and Chang, Ernie and Stock, Pierre and Mehdad, Yashar and Shi, Yangyang and Krishnamoorthi, Raghuraman and Chandra, Vikas , year = 2024, booktitle =

  26. [32]

    Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Luo, Ping , year = 2025, booktitle =

  27. [33]

    Tang, Shengkun and Sieberling, Oliver and Kurtic, Eldar and Shen, Zhiqiang and Alistarh, Dan , year = 2025, journal =

  28. [34]

    Liu, Kainan and Zhang, Yong and Cheng, Ning and Li, Zhitao and Wang, Shaojun and Xiao, Jing , year = 2025, booktitle =

  29. [35]

    Zhiqiang Shen and Tianhua Tao and Liqun Ma and Willie Neiswanger and Zhengzhong Liu and Hongyi Wang and Bowen Tan and Joel Hestness and Natalia Vassilieva and Daria Soboleva and Eric Xing , year = 2023, journal =

  30. [36]

    Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah , year = 2023, journal =

  31. [37]

    Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , year = 2024, booktitle =

  32. [38]

    Shrestha, Safal and Shrestha, Anubhav and Nepal, Aadim and Kim, Minwu and Ross, Keith , year = 2026, journal =

  33. [40]

    , year = 2020, journal =

    Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , year = 2020, journal =

  34. [41]

    Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard , year = 2016, journal =

  35. [42]

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , year = 2019, booktitle =

  36. [43]

    Bisk, Yonatan and Zellers, Rowan and Bras, Ronan Le and Gao, Jianfeng and Choi, Yejin , year = 2020, booktitle =

  37. [44]

    Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , year = 2021, month = aug, journal =

  38. [45]

    Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , year = 2018, booktitle =

  39. [46]

    Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , year = 2019, booktitle =

  40. [47]

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , year = 2019, booktitle =

  41. [48]

    Lin, Stephanie and Hilton, Jacob and Evans, Owain , year = 2022, booktitle =

  42. [49]

    Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , year = 2021, booktitle =

  43. [50]

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year = 2018, journal =

  44. [51]

    Amini, Aida and Gabriel, Saadia and Lin, Shanchuan and Koncel-Kedziorski, Rik and Choi, Yejin and Hajishirzi, Hannaneh , year = 2019, booktitle =

  45. [52]

    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , year = 2021, journal =

  46. [53]

    and Gardner, Matt , year = 2017, booktitle =

    Welbl, Johannes and Liu, Nelson F. and Gardner, Matt , year = 2017, booktitle =

  47. [54]

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  48. [55]

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , year = 2019, booktitle =

  49. [56]

    Xinrui Chen and Haoli Bai and Tao Yuan and Ruikang Liu and Kang Zhao and Xianzhi Yu and Lu Hou and Tian Guan and Yonghong He and Chun Yuan , year = 2026, booktitle =

  50. [57]

    Dettmers, Tim and Svirschevski, Ruslan and Egiazarian, Vage and Kuznedelev, Denis and Frantar, Elias and Ashkboos, Saleh and Borzunov, Alexander and Hoefler, Torsten and Alistarh, Dan , year = 2023, journal =

  51. [58]

    Mahoney and Kurt Keutzer , year = 2024, booktitle =

    Sehoon Kim and Coleman Richard Charles Hooper and Amir Gholami and Zhen Dong and Xiuyu Li and Sheng Shen and Michael W. Mahoney and Kurt Keutzer , year = 2024, booktitle =

  52. [59]

    Andrey Gromov and Kushal Tirumala and Hassan Shapourian and Paolo Glorioso and Dan Roberts , year = 2025, booktitle =

  53. [60]

    Siddiqui, Shoaib Ahmed and Dong, Xin and Heinrich, Greg and Breuel, Thomas and Kautz, Jan and Krueger, David and Molchanov, Pavlo , year = 2024, booktitle =

  54. [62]

    Shazeer, Noam , year = 2020, journal =

  55. [63]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , year = 2024, journal =

  56. [64]

    Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , year = 2025, journal =

  57. [65]

    Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and others , year = 2024, journal =

  58. [66]

    Bi, Xiao and Chen, Deli and Chen, Guanting and Chen, Shanhuang and Dai, Damai and Deng, Chengqi and Ding, Honghui and Dong, Kai and Du, Qiushi and Fu, Zhe and others , year = 2024, journal =

  59. [67]

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. 2021. https://doi.org/10.18653/v1/2021.acl-long.568 Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V...

  60. [68]

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

  61. [69]

    Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. 2024. Fluctuation-based adaptive structured pruning for large language models . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10865--10873

  62. [70]

    Saleh Ashkboos, Maximilian Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/316648eb8b4ffb6010f531b07848c300-Paper-Conference.pdf SliceGPT: Compress Large Language Models by Deleting Rows and Columns . In International Conference on Learning Representations, volume 2024...

  63. [71]

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, and 1 others. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism . arXiv preprint arXiv:2401.02954

  64. [72]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language . In Proceedings of the AAAI Conference on Artificial Intelligence

  65. [73]

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. 2025 a . EfficientQAT: Efficient Quantization-Aware Training for Large Language Models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10081--10100

  66. [74]

    Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. 2025 b . https://openreview.net/forum?id=IC5RJvRoMp Streamlining Redundant Layers to Compress Large Language Models . In International Conference on Learning Representations

  67. [75]

    Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, and Chun Yuan. 2026 a . https://openreview.net/forum?id=AwsiYZ2ets A Simple Linear Patch Revives Layer-Pruned Large Language Models . In Advances in Neural Information Processing Systems

  68. [76]

    Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, and Chun Yuan. 2026 b . https://doi.org/10.1609/aaai.v40i24.39120 Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation . Proceedings of the AAAI Conference on Artificial Intelligence, 40(24):20316–20324

  69. [77]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

  70. [78]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge . arXiv preprint arXiv:1803.05457

  71. [79]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems . arXiv preprint arXiv:2110.14168

  72. [80]

    Elia Cunegatti, Leonardo Lucio Custode, and Giovanni Iacca. 2025. https://openreview.net/forum?id=uPyNaNqFK2 Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training . Transactions on Machine Learning Research

  73. [81]

    Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot . In International Conference on Machine Learning, pages 10323--10337. PMLR

  74. [82]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers . arXiv preprint arXiv:2210.17323

  75. [83]

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://zenodo.org/records/12608602 A framework for...

  76. [84]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783

  77. [85]

    Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. 2025. https://openreview.net/forum?id=ngmEcEer8a The Unreasonable Ineffectiveness of the Deeper Layers . In International Conference on Learning Representations

  78. [86]

    Yuto Harada, Yusuke Yamauchi, Yusuke Oda, Yohei Oseki, Yusuke Miyao, and Yu Takagi. 2025. https://arxiv.org/abs/2506.14681 Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

  79. [87]

    Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. 2024. https://arxiv.org/abs/2406.15786 What Matters in Transformers? Not All Attention is Needed . Preprint, arXiv:2406.15786

  80. [88]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations

Showing first 80 references.