From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Elia Cunegatti; Erik Nielsen; Giovanni Iacca; Marcus Vukojevic

arxiv: 2606.02559 · v1 · pith:EMV2I3QNnew · submitted 2026-06-01 · 💻 cs.CL · cs.AI

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Elia Cunegatti , Marcus Vukojevic , Erik Nielsen , Giovanni Iacca This is my paper

Pith reviewed 2026-06-28 14:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM compressionsubmodule replacementresidual fittingpost-training compressiontransformer sparsityattention feedforwardmodel granularity

0 comments

The pith

SubFit compresses LLMs at the submodule level by replacing non-contiguous Attention and FeedForward components with individual fitted residual bypasses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that redundancy in pretrained transformers is not confined to contiguous regions and does not distribute evenly between Attention and FeedForward outputs, so existing full-layer contiguous replacement methods are overly restrictive. SubFit selects submodules non-contiguously and fits a lightweight residual bypass to each one separately, operating post-training with only calibration data. This produces the best aggregate perplexity-accuracy trade-off across ten models and five sparsity levels, with the largest gains at aggressive compression. A reader would care because the approach reduces model size and inference cost while preserving more capability than layer-level baselines.

Core claim

SubFit moves replacement-based compression from full-layer granularity with contiguous selection to submodule granularity, where Attention and FeedForward submodules are chosen independently across depth and each receives its own fitted residual bypass. Across five base and five instruction-tuned LLMs and sparsity levels from 12.5 percent to 37.5 percent, this yields the best overall perplexity-accuracy trade-off, with larger margins at higher sparsity; at 25 percent sparsity it retains 84.6 percent of dense downstream accuracy at 2.42 times perplexity degradation versus 81.6 percent and 4.34 times for the strongest baselines, plus measurable inference speedup and KV-cache savings.

What carries the argument

Submodule-level Fitted residual replacement (SubFit), which independently selects and approximates non-contiguous Attention and FeedForward submodules with per-submodule lightweight fitted residuals.

If this is right

SubFit improves the perplexity-accuracy trade-off at every evaluated sparsity level from 12.5 percent to 37.5 percent.
The gains widen under more aggressive compression.
The method applies equally to base and instruction-tuned models.
It delivers inference speedup and KV-cache reduction as direct by-products of the compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimal replacement strategies likely differ between Attention and FeedForward submodules, suggesting type-specific fitting could be explored further.
The non-contiguous selection pattern may transfer to other compression families such as structured pruning or quantization.
Hybrid pipelines could combine submodule replacement for some layers with full-layer methods for others.

Load-bearing premise

Redundancy in pretrained transformers is not confined to contiguous regions and does not evenly distribute between Attention and FeedForward outputs.

What would settle it

If a contiguous full-layer replacement method achieved equal or superior perplexity and downstream accuracy to SubFit on the same ten models at the same sparsity levels, the claimed advantage of submodule granularity would be falsified.

Figures

Figures reproduced from arXiv: 2606.02559 by Elia Cunegatti, Erik Nielsen, Giovanni Iacca, Marcus Vukojevic.

**Figure 2.** Figure 2: Workflow of SUBFIT. The blue boxes are relative to the Attention submodules replacement, while the green ones are about the FeedForward submodules. Notation. For a submodule Fℓ at layer ℓ, either Attention or FFN, we denote by x ∈ R d a generic token-level input and by y = Fℓ(x) ∈ R d the corresponding residual contribution to be approximated, where d is the hidden dimension, shared by both input and outp… view at source ↗

**Figure 3.** Figure 3: Fixed-selection rank sensitivity at 25% sparsity. FFN quality improves steadily up to rank 4096, while Attention saturates already at rank 256, confirming that the two submodule types require different ranks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at https://github.com/eliacunegatti/SubFit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SubFit gets better reported tradeoffs by switching to non-contiguous submodule compression with per-submodule fits, but the sparsity numbers may not line up on actual parameter removal.

read the letter

The one thing to know is that SubFit gets noticeably better perplexity and downstream accuracy by dropping to submodule granularity, picking non-contiguous Attention and FF blocks, and fitting a small residual bypass for each one. At 25% sparsity it keeps more accuracy and hurts perplexity less than the layer baselines.

What is actually new is the move away from full-layer contiguous replacement. The authors argue that redundancy is not uniform and not always in blocks, so they select submodules independently and fit each separately. That design choice is the main contribution beyond the baselines.

The paper does a solid job running the method on ten different LLMs, both base and instruction-tuned, at five sparsity levels, and comparing against four replacement baselines. The code release helps. The gains are larger at higher sparsity, which is where it matters most for practical use.

The main soft spot is the sparsity comparison itself. Removing 25% of submodules is not the same as removing 25% of layers because the two submodules per layer have different parameter counts. If the paper just counts the fraction of removed submodules without matching total parameters removed or FLOPs, then the headline numbers at "25% sparsity" are not directly comparable. That could inflate the apparent advantage. The abstract does not spell out the exact sparsity calculation, so this needs to be verified in the methods section.

Beyond that, the work looks like standard empirical compression research with calibration data and post-training fitting. No obvious circularity.

This is for people who work on model compression techniques for LLMs. Anyone thinking about finer-grained pruning or replacement would find the granularity discussion useful. It is worth sending to peer review because the empirical scope is decent and the code is public, even if the sparsity matching needs scrutiny.

Recommendation: send it for review with a note to check the parameter equivalence in the sparsity definition.

Referee Report

1 major / 2 minor

Summary. The paper introduces SubFit, a post-training replacement-based compression technique for LLMs that selects and replaces individual Attention and FeedForward submodules (rather than full layers) in a non-contiguous manner, each with its own lightweight fitted residual bypass. It evaluates the method on ten LLMs across five sparsity levels (12.5%–37.5%) against four layer-granularity baselines, claiming superior aggregate perplexity-accuracy trade-offs, with specific gains at 25% sparsity (84.6% downstream accuracy retention and 2.42× perplexity degradation vs. 81.6% and 4.34× for the strongest baseline), plus inference and KV-cache benefits.

Significance. If the empirical comparisons hold under matched compression ratios, the result would demonstrate that submodule-level granularity can better exploit non-uniform redundancy in transformers, improving the efficiency frontier for replacement-based compression without requiring retraining. The availability of code supports reproducibility.

major comments (1)

[Experimental setup / sparsity definition (likely §3.2 or §4)] The central empirical claim (abstract and §4) compares SubFit and layer-granularity baselines at identical reported sparsity percentages (e.g., 25%). However, each transformer layer contains two submodules of unequal parameter count; removing 25% of submodules therefore does not produce the same total parameter reduction, FLOPs, or functional impact as removing 25% of layers. The manuscript must define sparsity explicitly (parameter count, FLOPs, or submodule count) and demonstrate that the reported levels are matched on actual compression ratio before the perplexity-accuracy numbers can be directly compared.

minor comments (2)

[Abstract] The abstract states 'measurable inference speedup and KV-cache savings' without numbers; add quantitative values or a reference to the relevant table/figure.
[Method description] Notation for the fitted residual bypass (e.g., its parameter count relative to the replaced submodule) should be introduced once and used consistently in equations and text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental setup. We address the major comment below.

read point-by-point responses

Referee: [Experimental setup / sparsity definition (likely §3.2 or §4)] The central empirical claim (abstract and §4) compares SubFit and layer-granularity baselines at identical reported sparsity percentages (e.g., 25%). However, each transformer layer contains two submodules of unequal parameter count; removing 25% of submodules therefore does not produce the same total parameter reduction, FLOPs, or functional impact as removing 25% of layers. The manuscript must define sparsity explicitly (parameter count, FLOPs, or submodule count) and demonstrate that the reported levels are matched on actual compression ratio before the perplexity-accuracy numbers can be directly compared.

Authors: We agree that the manuscript requires an explicit definition of sparsity and verification that comparisons occur at matched compression ratios. Sparsity in the current version is defined as the fraction of submodules (SubFit) or layers (baselines) replaced, which does not guarantee identical parameter or FLOP reduction given the unequal sizes of Attention and FeedForward submodules. In the revision we will (1) state this definition clearly in §3, (2) add a table reporting the actual parameter reduction and estimated FLOP reduction for each reported sparsity level under both SubFit and the baselines, and (3) either adjust the sparsity levels or provide side-by-side matched-ratio comparisons so that the perplexity-accuracy results can be interpreted under equivalent compression budgets. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation is self-contained

full rationale

The paper proposes SubFit as a post-training compression technique at submodule granularity and reports direct empirical comparisons of perplexity and downstream accuracy against four replacement-based baselines across ten LLMs and five sparsity levels. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the abstract or described method. The central claims rest on external benchmark results rather than any reduction to the paper's own inputs or prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method introduces new fitted components and relies on the domain assumption about redundancy distribution; no machine-checked proofs or external benchmarks mentioned in abstract.

free parameters (1)

parameters of fitted residual bypasses
Lightweight modules fitted to calibration data for each replaced submodule.

axioms (1)

domain assumption Redundancy in LLMs is distributed non-contiguously and differently across submodule types
This is the key intuition stated as the basis for moving beyond full-layer contiguous selection.

invented entities (1)

Submodule-level fitted residual bypass no independent evidence
purpose: To approximate removed submodules at a finer granularity
Introduced as the core replacement mechanism in SubFit.

pith-pipeline@v0.9.1-grok · 5798 in / 1263 out tokens · 32012 ms · 2026-06-28T14:45:28.377843+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

118 extracted references · 6 canonical work pages

[3]

Elia Cunegatti and Leonardo Lucio Custode and Giovanni Iacca , year = 2025, journal =

2025
[4]

Xiaodong Chen and Yuxuan Hu and Jing Zhang and Yanling Wang and Cuiping Li and Hong Chen , year = 2025, booktitle =

2025
[5]

Shopkhoev, Dmitriy and Ali, Ammar and Zhussip, Magauiya and Malykh, Valentin and Lefkimmiatis, Stamatios and Komodakis, Nikos and Zagoruyko, Sergey , year = 2025, booktitle =

2025
[8]

Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , year = 2022, booktitle =

2022
[10]

Frantar, Elias and Alistarh, Dan , year = 2023, booktitle =

2023
[11]

Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, Zico , year = 2024, booktitle =

2024
[12]

Ashkboos, Saleh and Croci, Maximilian and Gennari do Nascimento, Marcelo and Hoefler, Torsten and Hensman, James , year = 2024, booktitle =

2024
[13]

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , year = 2017, booktitle =

2017
[15]

Anderson, Theodore Wilbur , year = 1951, journal =

1951
[16]

doi:https://doi.org/10.1016/0047-259X(75)90042-1 , issn =

Alan Julian Izenman , year = 1975, journal =. doi:https://doi.org/10.1016/0047-259X(75)90042-1 , issn =

work page doi:10.1016/0047-259x(75)90042-1 1975
[17]

Eckart, Carl and Young, Gale , year = 1936, journal =

1936
[18]

Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , year = 2024, journal =

2024
[19]

Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song , year = 2023, booktitle =

2023
[20]

Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan , year = 2022, journal =

2022
[21]

Shao, Wenqi and Chen, Mengzhao and Zhang, Zhaoyang and Xu, Peng and Zhao, Lirui and Li, Zhiqian and Zhang, Kaipeng and Peng, Gao and Qiao, Yu and Luo, Ping , year = 2024, booktitle =

2024
[22]

Xinyin Ma and Gongfan Fang and Xinchao Wang , year = 2023, booktitle =

2023
[23]

Fabrizio Sandri and Elia Cunegatti and Giovanni Iacca , year = 2025, journal =

2025
[24]

Men, Xin and Xu, Mingyu and Zhang, Qingyu and Yuan, Qianhao and Wang, Bingning and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Chen, Weipeng , year = 2025, booktitle =

2025
[25]

Yang, Yifei and Cao, Zouying and Zhao, Hai , year = 2024, booktitle =

2024
[26]

Lui , year = 2026, booktitle =

Yikun Jiang and Huanyu Wang and Tianhong Ding and Wenhu Zhang and Yiming Wu and Hanbin Zhao and John C.S. Lui , year = 2026, booktitle =

2026
[27]

An, Yongqi and Zhao, Xu and Yu, Tao and Tang, Ming and Wang, Jinqiao , year = 2024, booktitle =

2024
[28]

Ash and Dipendra Misra , year = 2024, booktitle =

Pratyusha Sharma and Jordan T. Ash and Dipendra Misra , year = 2024, booktitle =

2024
[29]

Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi , year = 2024, booktitle =

2024
[30]

Liu, Yijiang and Yang, Huanrui and Chen, Youxin and Zhang, Rongyu and Wang, Miao and Du, Yuan and Du, Li , year = 2025, booktitle =

2025
[31]

Liu, Zechun and Oguz, Barlas and Zhao, Changsheng and Chang, Ernie and Stock, Pierre and Mehdad, Yashar and Shi, Yangyang and Krishnamoorthi, Raghuraman and Chandra, Vikas , year = 2024, booktitle =

2024
[32]

Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Luo, Ping , year = 2025, booktitle =

2025
[33]

Tang, Shengkun and Sieberling, Oliver and Kurtic, Eldar and Shen, Zhiqiang and Alistarh, Dan , year = 2025, journal =

2025
[34]

Liu, Kainan and Zhang, Yong and Cheng, Ning and Li, Zhitao and Wang, Shaojun and Xiao, Jing , year = 2025, booktitle =

2025
[35]

Zhiqiang Shen and Tianhua Tao and Liqun Ma and Willie Neiswanger and Zhengzhong Liu and Hongyi Wang and Bowen Tan and Joel Hestness and Natalia Vassilieva and Daria Soboleva and Eric Xing , year = 2023, journal =

2023
[36]

Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah , year = 2023, journal =

2023
[37]

Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , year = 2024, booktitle =

2024
[38]

Shrestha, Safal and Shrestha, Anubhav and Nepal, Aadim and Kim, Minwu and Ross, Keith , year = 2026, journal =

2026
[40]

, year = 2020, journal =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , year = 2020, journal =

2020
[41]

Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard , year = 2016, journal =

2016
[42]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , year = 2019, booktitle =

2019
[43]

Bisk, Yonatan and Zellers, Rowan and Bras, Ronan Le and Gao, Jianfeng and Choi, Yejin , year = 2020, booktitle =

2020
[44]

Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , year = 2021, month = aug, journal =

2021
[45]

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , year = 2018, booktitle =

2018
[46]

Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , year = 2019, booktitle =

2019
[47]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , year = 2019, booktitle =

2019
[48]

Lin, Stephanie and Hilton, Jacob and Evans, Owain , year = 2022, booktitle =

2022
[49]

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , year = 2021, booktitle =

2021
[50]

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year = 2018, journal =

2018
[51]

Amini, Aida and Gabriel, Saadia and Lin, Shanchuan and Koncel-Kedziorski, Rik and Choi, Yejin and Hajishirzi, Hannaneh , year = 2019, booktitle =

2019
[52]

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , year = 2021, journal =

2021
[53]

and Gardner, Matt , year = 2017, booktitle =

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt , year = 2017, booktitle =

2017
[54]

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

2024
[55]

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , year = 2019, booktitle =

2019
[56]

Xinrui Chen and Haoli Bai and Tao Yuan and Ruikang Liu and Kang Zhao and Xianzhi Yu and Lu Hou and Tian Guan and Yonghong He and Chun Yuan , year = 2026, booktitle =

2026
[57]

Dettmers, Tim and Svirschevski, Ruslan and Egiazarian, Vage and Kuznedelev, Denis and Frantar, Elias and Ashkboos, Saleh and Borzunov, Alexander and Hoefler, Torsten and Alistarh, Dan , year = 2023, journal =

2023
[58]

Mahoney and Kurt Keutzer , year = 2024, booktitle =

Sehoon Kim and Coleman Richard Charles Hooper and Amir Gholami and Zhen Dong and Xiuyu Li and Sheng Shen and Michael W. Mahoney and Kurt Keutzer , year = 2024, booktitle =

2024
[59]

Andrey Gromov and Kushal Tirumala and Hassan Shapourian and Paolo Glorioso and Dan Roberts , year = 2025, booktitle =

2025
[60]

Siddiqui, Shoaib Ahmed and Dong, Xin and Heinrich, Greg and Breuel, Thomas and Kautz, Jan and Krueger, David and Molchanov, Pavlo , year = 2024, booktitle =

2024
[62]

Shazeer, Noam , year = 2020, journal =

2020
[63]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , year = 2024, journal =

2024
[64]

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , year = 2025, journal =

2025
[65]

Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and others , year = 2024, journal =

2024
[66]

Bi, Xiao and Chen, Deli and Chen, Guanting and Chen, Shanhuang and Dai, Damai and Deng, Chengqi and Ding, Honghui and Dong, Kai and Du, Qiushi and Fu, Zhe and others , year = 2024, journal =

2024
[67]

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. 2021. https://doi.org/10.18653/v1/2021.acl-long.568 Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V...

work page doi:10.18653/v1/2021.acl-long.568 2021
[68]

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2019
[69]

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. 2024. Fluctuation-based adaptive structured pruning for large language models . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10865--10873

2024
[70]

Saleh Ashkboos, Maximilian Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/316648eb8b4ffb6010f531b07848c300-Paper-Conference.pdf SliceGPT: Compress Large Language Models by Deleting Rows and Columns . In International Conference on Learning Representations, volume 2024...

2024
[71]

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, and 1 others. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism . arXiv preprint arXiv:2401.02954

Pith/arXiv arXiv 2024
[72]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language . In Proceedings of the AAAI Conference on Artificial Intelligence

2020
[73]

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. 2025 a . EfficientQAT: Efficient Quantization-Aware Training for Large Language Models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10081--10100

2025
[74]

Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. 2025 b . https://openreview.net/forum?id=IC5RJvRoMp Streamlining Redundant Layers to Compress Large Language Models . In International Conference on Learning Representations

2025
[75]

Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, and Chun Yuan. 2026 a . https://openreview.net/forum?id=AwsiYZ2ets A Simple Linear Patch Revives Layer-Pruned Large Language Models . In Advances in Neural Information Processing Systems

2026
[76]

Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, and Chun Yuan. 2026 b . https://doi.org/10.1609/aaai.v40i24.39120 Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation . Proceedings of the AAAI Conference on Artificial Intelligence, 40(24):20316–20324

work page doi:10.1609/aaai.v40i24.39120 2026
[77]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2019
[78]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge . arXiv preprint arXiv:1803.05457

Pith/arXiv arXiv 2018
[79]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems . arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021
[80]

Elia Cunegatti, Leonardo Lucio Custode, and Giovanni Iacca. 2025. https://openreview.net/forum?id=uPyNaNqFK2 Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training . Transactions on Machine Learning Research

2025
[81]

Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot . In International Conference on Machine Learning, pages 10323--10337. PMLR

2023
[82]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers . arXiv preprint arXiv:2210.17323

Pith/arXiv arXiv 2022
[83]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://zenodo.org/records/12608602 A framework for...

arXiv 2024
[84]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024
[85]

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. 2025. https://openreview.net/forum?id=ngmEcEer8a The Unreasonable Ineffectiveness of the Deeper Layers . In International Conference on Learning Representations

2025
[86]

Yuto Harada, Yusuke Yamauchi, Yusuke Oda, Yohei Oseki, Yusuke Miyao, and Yu Takagi. 2025. https://arxiv.org/abs/2506.14681 Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

arXiv 2025
[87]

Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. 2024. https://arxiv.org/abs/2406.15786 What Matters in Transformers? Not All Attention is Needed . Preprint, arXiv:2406.15786

arXiv 2024
[88]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations

2021

Showing first 80 references.

[1] [3]

Elia Cunegatti and Leonardo Lucio Custode and Giovanni Iacca , year = 2025, journal =

2025

[2] [4]

Xiaodong Chen and Yuxuan Hu and Jing Zhang and Yanling Wang and Cuiping Li and Hong Chen , year = 2025, booktitle =

2025

[3] [5]

Shopkhoev, Dmitriy and Ali, Ammar and Zhussip, Magauiya and Malykh, Valentin and Lefkimmiatis, Stamatios and Komodakis, Nikos and Zagoruyko, Sergey , year = 2025, booktitle =

2025

[4] [8]

Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , year = 2022, booktitle =

2022

[5] [10]

Frantar, Elias and Alistarh, Dan , year = 2023, booktitle =

2023

[6] [11]

Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, Zico , year = 2024, booktitle =

2024

[7] [12]

Ashkboos, Saleh and Croci, Maximilian and Gennari do Nascimento, Marcelo and Hoefler, Torsten and Hensman, James , year = 2024, booktitle =

2024

[8] [13]

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , year = 2017, booktitle =

2017

[9] [15]

Anderson, Theodore Wilbur , year = 1951, journal =

1951

[10] [16]

doi:https://doi.org/10.1016/0047-259X(75)90042-1 , issn =

Alan Julian Izenman , year = 1975, journal =. doi:https://doi.org/10.1016/0047-259X(75)90042-1 , issn =

work page doi:10.1016/0047-259x(75)90042-1 1975

[11] [17]

Eckart, Carl and Young, Gale , year = 1936, journal =

1936

[12] [18]

Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , year = 2024, journal =

2024

[13] [19]

Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song , year = 2023, booktitle =

2023

[14] [20]

Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan , year = 2022, journal =

2022

[15] [21]

Shao, Wenqi and Chen, Mengzhao and Zhang, Zhaoyang and Xu, Peng and Zhao, Lirui and Li, Zhiqian and Zhang, Kaipeng and Peng, Gao and Qiao, Yu and Luo, Ping , year = 2024, booktitle =

2024

[16] [22]

Xinyin Ma and Gongfan Fang and Xinchao Wang , year = 2023, booktitle =

2023

[17] [23]

Fabrizio Sandri and Elia Cunegatti and Giovanni Iacca , year = 2025, journal =

2025

[18] [24]

Men, Xin and Xu, Mingyu and Zhang, Qingyu and Yuan, Qianhao and Wang, Bingning and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Chen, Weipeng , year = 2025, booktitle =

2025

[19] [25]

Yang, Yifei and Cao, Zouying and Zhao, Hai , year = 2024, booktitle =

2024

[20] [26]

Lui , year = 2026, booktitle =

Yikun Jiang and Huanyu Wang and Tianhong Ding and Wenhu Zhang and Yiming Wu and Hanbin Zhao and John C.S. Lui , year = 2026, booktitle =

2026

[21] [27]

An, Yongqi and Zhao, Xu and Yu, Tao and Tang, Ming and Wang, Jinqiao , year = 2024, booktitle =

2024

[22] [28]

Ash and Dipendra Misra , year = 2024, booktitle =

Pratyusha Sharma and Jordan T. Ash and Dipendra Misra , year = 2024, booktitle =

2024

[23] [29]

Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi , year = 2024, booktitle =

2024

[24] [30]

Liu, Yijiang and Yang, Huanrui and Chen, Youxin and Zhang, Rongyu and Wang, Miao and Du, Yuan and Du, Li , year = 2025, booktitle =

2025

[25] [31]

Liu, Zechun and Oguz, Barlas and Zhao, Changsheng and Chang, Ernie and Stock, Pierre and Mehdad, Yashar and Shi, Yangyang and Krishnamoorthi, Raghuraman and Chandra, Vikas , year = 2024, booktitle =

2024

[26] [32]

Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Luo, Ping , year = 2025, booktitle =

2025

[27] [33]

Tang, Shengkun and Sieberling, Oliver and Kurtic, Eldar and Shen, Zhiqiang and Alistarh, Dan , year = 2025, journal =

2025

[28] [34]

Liu, Kainan and Zhang, Yong and Cheng, Ning and Li, Zhitao and Wang, Shaojun and Xiao, Jing , year = 2025, booktitle =

2025

[29] [35]

Zhiqiang Shen and Tianhua Tao and Liqun Ma and Willie Neiswanger and Zhengzhong Liu and Hongyi Wang and Bowen Tan and Joel Hestness and Natalia Vassilieva and Daria Soboleva and Eric Xing , year = 2023, journal =

2023

[30] [36]

Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah , year = 2023, journal =

2023

[31] [37]

Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , year = 2024, booktitle =

2024

[32] [38]

Shrestha, Safal and Shrestha, Anubhav and Nepal, Aadim and Kim, Minwu and Ross, Keith , year = 2026, journal =

2026

[33] [40]

, year = 2020, journal =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , year = 2020, journal =

2020

[34] [41]

Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard , year = 2016, journal =

2016

[35] [42]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , year = 2019, booktitle =

2019

[36] [43]

Bisk, Yonatan and Zellers, Rowan and Bras, Ronan Le and Gao, Jianfeng and Choi, Yejin , year = 2020, booktitle =

2020

[37] [44]

Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , year = 2021, month = aug, journal =

2021

[38] [45]

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , year = 2018, booktitle =

2018

[39] [46]

Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , year = 2019, booktitle =

2019

[40] [47]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , year = 2019, booktitle =

2019

[41] [48]

Lin, Stephanie and Hilton, Jacob and Evans, Owain , year = 2022, booktitle =

2022

[42] [49]

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , year = 2021, booktitle =

2021

[43] [50]

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year = 2018, journal =

2018

[44] [51]

Amini, Aida and Gabriel, Saadia and Lin, Shanchuan and Koncel-Kedziorski, Rik and Choi, Yejin and Hajishirzi, Hannaneh , year = 2019, booktitle =

2019

[45] [52]

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , year = 2021, journal =

2021

[46] [53]

and Gardner, Matt , year = 2017, booktitle =

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt , year = 2017, booktitle =

2017

[47] [54]

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

2024

[48] [55]

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , year = 2019, booktitle =

2019

[49] [56]

Xinrui Chen and Haoli Bai and Tao Yuan and Ruikang Liu and Kang Zhao and Xianzhi Yu and Lu Hou and Tian Guan and Yonghong He and Chun Yuan , year = 2026, booktitle =

2026

[50] [57]

Dettmers, Tim and Svirschevski, Ruslan and Egiazarian, Vage and Kuznedelev, Denis and Frantar, Elias and Ashkboos, Saleh and Borzunov, Alexander and Hoefler, Torsten and Alistarh, Dan , year = 2023, journal =

2023

[51] [58]

Mahoney and Kurt Keutzer , year = 2024, booktitle =

Sehoon Kim and Coleman Richard Charles Hooper and Amir Gholami and Zhen Dong and Xiuyu Li and Sheng Shen and Michael W. Mahoney and Kurt Keutzer , year = 2024, booktitle =

2024

[52] [59]

Andrey Gromov and Kushal Tirumala and Hassan Shapourian and Paolo Glorioso and Dan Roberts , year = 2025, booktitle =

2025

[53] [60]

Siddiqui, Shoaib Ahmed and Dong, Xin and Heinrich, Greg and Breuel, Thomas and Kautz, Jan and Krueger, David and Molchanov, Pavlo , year = 2024, booktitle =

2024

[54] [62]

Shazeer, Noam , year = 2020, journal =

2020

[55] [63]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , year = 2024, journal =

2024

[56] [64]

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , year = 2025, journal =

2025

[57] [65]

Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and others , year = 2024, journal =

2024

[58] [66]

Bi, Xiao and Chen, Deli and Chen, Guanting and Chen, Shanhuang and Dai, Damai and Deng, Chengqi and Ding, Honghui and Dong, Kai and Du, Qiushi and Fu, Zhe and others , year = 2024, journal =

2024

[59] [67]

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. 2021. https://doi.org/10.18653/v1/2021.acl-long.568 Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V...

work page doi:10.18653/v1/2021.acl-long.568 2021

[60] [68]

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2019

[61] [69]

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. 2024. Fluctuation-based adaptive structured pruning for large language models . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10865--10873

2024

[62] [70]

Saleh Ashkboos, Maximilian Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/316648eb8b4ffb6010f531b07848c300-Paper-Conference.pdf SliceGPT: Compress Large Language Models by Deleting Rows and Columns . In International Conference on Learning Representations, volume 2024...

2024

[63] [71]

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, and 1 others. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism . arXiv preprint arXiv:2401.02954

Pith/arXiv arXiv 2024

[64] [72]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language . In Proceedings of the AAAI Conference on Artificial Intelligence

2020

[65] [73]

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. 2025 a . EfficientQAT: Efficient Quantization-Aware Training for Large Language Models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10081--10100

2025

[66] [74]

Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. 2025 b . https://openreview.net/forum?id=IC5RJvRoMp Streamlining Redundant Layers to Compress Large Language Models . In International Conference on Learning Representations

2025

[67] [75]

Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, and Chun Yuan. 2026 a . https://openreview.net/forum?id=AwsiYZ2ets A Simple Linear Patch Revives Layer-Pruned Large Language Models . In Advances in Neural Information Processing Systems

2026

[68] [76]

Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, and Chun Yuan. 2026 b . https://doi.org/10.1609/aaai.v40i24.39120 Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation . Proceedings of the AAAI Conference on Artificial Intelligence, 40(24):20316–20324

work page doi:10.1609/aaai.v40i24.39120 2026

[69] [77]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2019

[70] [78]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge . arXiv preprint arXiv:1803.05457

Pith/arXiv arXiv 2018

[71] [79]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems . arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021

[72] [80]

Elia Cunegatti, Leonardo Lucio Custode, and Giovanni Iacca. 2025. https://openreview.net/forum?id=uPyNaNqFK2 Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training . Transactions on Machine Learning Research

2025

[73] [81]

Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot . In International Conference on Machine Learning, pages 10323--10337. PMLR

2023

[74] [82]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers . arXiv preprint arXiv:2210.17323

Pith/arXiv arXiv 2022

[75] [83]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://zenodo.org/records/12608602 A framework for...

arXiv 2024

[76] [84]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024

[77] [85]

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. 2025. https://openreview.net/forum?id=ngmEcEer8a The Unreasonable Ineffectiveness of the Deeper Layers . In International Conference on Learning Representations

2025

[78] [86]

Yuto Harada, Yusuke Yamauchi, Yusuke Oda, Yohei Oseki, Yusuke Miyao, and Yu Takagi. 2025. https://arxiv.org/abs/2506.14681 Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

arXiv 2025

[79] [87]

Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. 2024. https://arxiv.org/abs/2406.15786 What Matters in Transformers? Not All Attention is Needed . Preprint, arXiv:2406.15786

arXiv 2024

[80] [88]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations

2021