Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

Junhyuk Jo; Sai Praneeth Karimireddy; Sunwoo Lee; Vincent-Daniel Yun

arxiv: 2605.15491 · v1 · pith:FVQMEHJWnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI· cs.PF

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

Vincent-Daniel Yun , Junhyuk Jo , Sai Praneeth Karimireddy , Sunwoo Lee This is my paper

Pith reviewed 2026-05-19 16:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PF

keywords layer pruningLLM compressionactivation alignmenttraining-free recoverylinear operatorboundary mismatchTransformer decodercalibration set

0 comments

The pith

A closed-form linear operator derived from calibration data can reconstruct the hidden-state mismatch caused by removing entire layers from large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that layer pruning creates a predictable activation discrepancy between the output of one surviving layer and the input expected by the next. By solving for the single best linear transformation that minimizes this discrepancy on a small calibration set, the authors recover most of the lost performance without any retraining. This works because the derived operator is the mathematically unconstrained optimum of the alignment objective, whereas earlier recovery methods were forced to search inside smaller families of possible transformations. If the claim holds, layer pruning becomes a more reliable way to shrink and speed up large models while preserving accuracy on downstream tasks.

Core claim

What carries the argument

The closed-form optimal linear operator for boundary activation alignment, obtained by solving the least-squares problem on a calibration set to minimize the difference between pruned and original activations.

If this is right

The method yields higher accuracy and lower perplexity than prior training-free recovery techniques on multiple LLM families and pruning strategies.
The efficiency gains from layer pruning, such as reduced inference latency and memory use, remain intact because the added operator is a single matrix multiplication.
Because the solution is the true unconstrained optimum rather than an approximation inside a restricted subspace, further improvements would require changing the objective itself rather than searching harder within the same family.
The approach is training-free and uses only a small calibration set, so it can be applied after any pruning decision without additional optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the discrepancy introduced by pruning turns out to be largely linear, similar closed-form operators might correct other compression artifacts such as those from low-rank adaptation or early-exit mechanisms.
The calibration-set requirement suggests that periodically refitting the operator on recent user data could keep recovery quality high when the input distribution shifts over time.
Because the operator is derived once and then fixed, it could be fused into the adjacent layers at deployment time to eliminate any extra runtime cost beyond the original pruning savings.

Load-bearing premise

The activation discrepancy caused by removed layers can be accurately captured and reversed by one linear transformation that was fitted on limited calibration examples and then works for every input the model will see later.

What would settle it

If the linear operator fitted on the calibration set produces no measurable reduction in activation mismatch or no gain in perplexity and accuracy when tested on a large, held-out set of diverse inputs, the claim that it provides the effective unconstrained recovery would be falsified.

Figures

Figures reproduced from arXiv: 2605.15491 by Junhyuk Jo, Sai Praneeth Karimireddy, Sunwoo Lee, Vincent-Daniel Yun.

**Figure 1.** Figure 1: Mean absolute error between the expected boundary activation and the activation received by downstream [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Ghosted Layers as drop-in replacements for pruned transformer blocks. One or more consecutive transformer [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Frobenius norm decomposition of M∗ into symmetric and anti-symmetric components across two LLM backbones (n = 7, LLM-Streamline). Detailed setups are in Appendix A.2 Constrained solution space. As established in Theorem 4.1, W∗ is the unconstrained minimizer over all of R C×C , whereas any symmetric operator W satisfies W − W⊤ = 0 and is thus confined to the symmetric subspace. To empirically verify that W… view at source ↗

**Figure 4.** Figure 4: Per-channel mean absolute error (MAE) between the repaired boundary activation [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Average accuracy across 9 commonsense reasoning benchmarks with LLaMA-3.1-8B Efficiency Comparison [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. We propose Ghosted Layers, a training-free recovery module that addresses this issue by solving a boundary activation alignment problem. Our method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. We show that this solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces. Experiments across multiple LLM backbones and pruning strategies demonstrate that our method consistently improves accuracy and perplexity over prior training-free baselines, while preserving the efficiency gains of layer pruning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The closed-form linear operator for activation alignment after layer pruning is a clean incremental step that beats prior training-free baselines in the reported tests, but the generalization claim rests on unexamined distribution shift.

read the letter

The main takeaway is that this paper derives a closed-form linear operator from a small calibration set to fix the activation mismatch created by removing entire Transformer layers. It positions this as the unconstrained optimum of the alignment objective, unlike earlier methods that stayed inside restricted subspaces, and the experiments show consistent gains in accuracy and perplexity across a few LLM backbones and pruning setups while keeping the inference speedups intact. That part is useful for anyone who wants to prune layers without retraining. The approach stays training-free and the math is presented as a direct least-squares style solution, which is straightforward to implement if the details hold up. What the work does well is deliver measurable recovery without adding parameters or compute at inference, and the results appear to improve on the training-free baselines they compare against. The soft spots sit mostly in the missing pieces around robustness. The central claim depends on the fitted linear map staying effective when activations shift across the full input distribution at inference, yet the abstract gives no bounds, no sensitivity analysis on the calibration set, and no error bars on the gains. Without those, it is hard to know whether the improvement is reliable or mostly tied to the particular calibration samples. The derivation is called closed-form and optimal, but the lack of explicit steps leaves open whether any downstream-task dependence sneaks in. Citation patterns look standard for this area and do not seem inflated. This is the kind of paper that would interest people working on practical LLM compression and deployment. A reader focused on post-pruning recovery methods would find the specific unconstrained formulation and the empirical comparisons worth looking at. It has enough of a concrete idea and reported improvement to go to serious referees rather than a desk reject, even though the generalization question will need addressing in revision. I would send it out for review.

Referee Report

3 major / 2 minor

Summary. The paper claims that layer pruning in LLMs creates a boundary activation mismatch that can be recovered training-free by Ghosted Layers: a closed-form optimal linear operator fitted on a small calibration set to reconstruct the pruned-layer activation discrepancy. It asserts this operator is the unconstrained optimum of the alignment objective (unlike prior methods limited to constrained subspaces), and reports consistent accuracy and perplexity gains over baselines across multiple LLM backbones and pruning strategies.

Significance. If the linear operator generalizes beyond the calibration distribution and the closed-form derivation is independent of downstream task loss, the approach would offer a lightweight, training-free way to mitigate pruning-induced degradation while retaining the efficiency benefits of layer removal. The parameter-free character of the claimed optimum would be a notable strength for reproducibility.

major comments (3)

[Abstract / boundary activation alignment problem] Abstract and boundary activation alignment paragraph: the claim that a single linear operator fitted on a small calibration set reconstructs the discrepancy and remains effective across the full inference distribution lacks any robustness argument or bound on distribution shift; subsequent nonlinear Transformer layers can alter the required mapping, and no explicit test of this assumption is supplied.
[Abstract] Abstract: the statement that the solution 'corresponds to the unconstrained optimum' is presented without derivation details, equations, or a proof that the operator is independent of the downstream task loss; without these, it is unclear whether the closed-form reduces to an empirical fit on the calibration data rather than a true unconstrained optimum.
[Experiments] Experiments section: no error bars, standard deviations, or details on calibration-set selection and size are provided, making it impossible to assess whether the reported consistent gains are statistically reliable or sensitive to the choice of calibration data.

minor comments (2)

The manuscript should include a clear statement of the exact least-squares objective and the resulting closed-form expression for the linear operator (presumably W = Y X^+ or equivalent) so readers can verify the unconstrained claim.
Figure and table captions would benefit from explicit mention of the calibration-set size and the pruning ratios tested to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have addressed each major comment below and revised the paper accordingly to improve clarity, rigor, and reproducibility.

read point-by-point responses

Referee: [Abstract / boundary activation alignment problem] Abstract and boundary activation alignment paragraph: the claim that a single linear operator fitted on a small calibration set reconstructs the discrepancy and remains effective across the full inference distribution lacks any robustness argument or bound on distribution shift; subsequent nonlinear Transformer layers can alter the required mapping, and no explicit test of this assumption is supplied.

Authors: We agree that a formal robustness bound or theoretical analysis of distribution shift would strengthen the claims. The linear operator is derived to minimize the immediate boundary mismatch on the calibration set, and while subsequent nonlinear layers can in principle modify the effective mapping, the alignment is applied precisely at the interface to reduce propagation of the discrepancy. Our experiments already test generalization across multiple models, pruning ratios, and evaluation datasets that differ from the calibration distribution. In the revision we have added a dedicated paragraph in Section 3.2 discussing the modeling assumptions and limitations, and we include new experiments evaluating performance on out-of-distribution prompts to provide more explicit empirical support for the assumption. revision: partial
Referee: [Abstract] Abstract: the statement that the solution 'corresponds to the unconstrained optimum' is presented without derivation details, equations, or a proof that the operator is independent of the downstream task loss; without these, it is unclear whether the closed-form reduces to an empirical fit on the calibration data rather than a true unconstrained optimum.

Authors: We appreciate this observation. The full derivation appears in Section 3, where we formulate the alignment objective as an unconstrained least-squares problem over the linear operator and obtain the closed-form solution via the normal equations; this solution depends only on the observed activation pairs from the calibration set and contains no dependence on any downstream task loss. To address the referee's concern we have expanded the abstract to include a brief reference to the derivation and added a pointer to the relevant equations (Eqs. 3–6) so that readers can immediately locate the proof that the operator is the unconstrained optimum. revision: yes
Referee: [Experiments] Experiments section: no error bars, standard deviations, or details on calibration-set selection and size are provided, making it impossible to assess whether the reported consistent gains are statistically reliable or sensitive to the choice of calibration data.

Authors: We thank the referee for noting this gap in reporting. The revised Experiments section now reports mean performance together with standard deviations and error bars computed over five independent random seeds for every metric and model. We have also added a new paragraph detailing the calibration-set construction: for each experiment we randomly sample 256 sequences (each of length 512 tokens) from the training split of the respective dataset, with the random seed fixed for reproducibility; sensitivity to calibration-set size is additionally explored in an appendix table. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a standard closed-form solution to an explicitly stated alignment objective

full rationale

The paper defines a boundary activation alignment objective and derives its unconstrained optimum as a closed-form linear operator fitted on calibration activations. This is a direct mathematical solution to the stated minimization problem rather than a reduction of the claimed result to its own inputs by construction. No self-citations are invoked as load-bearing premises, no uniqueness theorems are imported from prior author work, and no fitted parameter is relabeled as an independent prediction. The central claim remains that the derived operator is unconstrained (in contrast to prior constrained subspaces), which follows from the problem formulation itself without tautology. Generalization from calibration to inference is an empirical assumption but does not render the derivation chain circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that activation mismatch after pruning is well-approximated by a linear transformation recoverable from limited calibration data. No explicit free parameters, axioms, or invented entities are named in the abstract, but the linear-operator assumption functions as an unstated domain assumption.

axioms (1)

domain assumption The activation discrepancy at layer boundaries after pruning is reconstructible by a linear operator derived from a small calibration set.
Invoked in the description of the boundary activation alignment problem and the closed-form solution.

pith-pipeline@v0.9.0 · 5673 in / 1363 out tokens · 40276 ms · 2026-05-19T16:10:28.262917+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

[1]

Fluctuation-based adaptive structured pruning for large language models

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence. AAAI Press, 2024

work page 2024
[2]

Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...

work page 1901
[4]

Streamlining redundant lay- ers to compress large language models

Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. Streamlining redundant lay- ers to compress large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[5]

A simple linear patch revives layer-pruned large language models

Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, and Chun Yuan. A simple linear patch revives layer-pruned large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[6]

Prune&comp: Free lunch for layer-pruned LLMs via iterative pruning with magnitude compensation.arXiv preprint arXiv:2507.18212, 2025

Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, and Chun Yuan. Prune&comp: Free lunch for layer-pruned LLMs via iterative pruning with magnitude compensation.arXiv preprint arXiv:2507.18212, 2025

work page arXiv 2025
[7]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

work page 2019
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

The PASCAL recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177–190. Springer, 2005

work page 2005
[10]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023

work page arXiv 2023
[11]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[12]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, Baltimore, MD, fourth edition, 2013. 10

work page 2013
[13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The LLaMA 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

The unreasonable ineffectiveness of the deeper layers

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. The unreasonable ineffectiveness of the deeper layers. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[15]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025

work page 2025
[16]

Higham.Accuracy and Stability of Numerical Algorithms

Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, second edition, 2002

work page 2002
[17]

arXiv preprint arXiv:2402.02834 , volume=

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened LLaMA: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834, 2024

work page arXiv 2024
[18]

RACE: Large-scale ReAding compre- hension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding compre- hension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, 2017

work page 2017
[19]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[20]

LLM-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-pruner: On the structural pruning of large language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[21]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993

work page 1993
[22]

ShortGPT: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, July 2025

work page 2025
[23]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017

work page 2017
[24]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

work page 2018
[25]

Compact language models via pruning and knowledge distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. InAdvances in Neural Information Processing Systems, volume 37, pages 41076– 41102, 2024

work page 2024
[26]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(1), 2020

work page 2020
[28]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. InAAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, 2011

work page 2011
[29]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[30]

Replaceme: Network simplification via depth pruning and transformer block linearization

Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, and Sergey Zagoruyko. Replaceme: Network simplification via depth pruning and transformer block linearization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[31]

Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks

Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[32]

A simple and effective pruning approach for large language models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024. 11

work page 2024
[33]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Smith, and Hannaneh Hajishirzi

Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Jiacheng...

work page 2025
[36]

arXiv preprint arXiv:2310.06694 (2023)

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694, 2023

work page arXiv 2023
[37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 12 Appendix: Ghosted Layers This appendix provides supplementary materials that complement the main paper. It includes the proof of ou...

work page 2019
[39]

We adopt 32 sequences as the default since this is the smallest size at which downstream accuracy is already saturated, and the additional perplexity reduction from larger calibration sets does not translate into accuracy gains. D Fine-tuning results D.1 Fine-tuning setup We follow the fine-tuning protocol ofLinearPatch [5] exactly to ensure a fair head-t...

work page 2086

[1] [1]

Fluctuation-based adaptive structured pruning for large language models

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence. AAAI Press, 2024

work page 2024

[2] [2]

Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[3] [3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...

work page 1901

[4] [4]

Streamlining redundant lay- ers to compress large language models

Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. Streamlining redundant lay- ers to compress large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[5] [5]

A simple linear patch revives layer-pruned large language models

Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, and Chun Yuan. A simple linear patch revives layer-pruned large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[6] [6]

Prune&comp: Free lunch for layer-pruned LLMs via iterative pruning with magnitude compensation.arXiv preprint arXiv:2507.18212, 2025

Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, and Chun Yuan. Prune&comp: Free lunch for layer-pruned LLMs via iterative pruning with magnitude compensation.arXiv preprint arXiv:2507.18212, 2025

work page arXiv 2025

[7] [7]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

work page 2019

[8] [8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

The PASCAL recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177–190. Springer, 2005

work page 2005

[10] [10]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023

work page arXiv 2023

[11] [11]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024

[12] [12]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, Baltimore, MD, fourth edition, 2013. 10

work page 2013

[13] [13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The LLaMA 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

The unreasonable ineffectiveness of the deeper layers

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. The unreasonable ineffectiveness of the deeper layers. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[15] [15]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025

work page 2025

[16] [16]

Higham.Accuracy and Stability of Numerical Algorithms

Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, second edition, 2002

work page 2002

[17] [17]

arXiv preprint arXiv:2402.02834 , volume=

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened LLaMA: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834, 2024

work page arXiv 2024

[18] [18]

RACE: Large-scale ReAding compre- hension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding compre- hension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, 2017

work page 2017

[19] [19]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[20] [20]

LLM-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-pruner: On the structural pruning of large language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[21] [21]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993

work page 1993

[22] [22]

ShortGPT: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, July 2025

work page 2025

[23] [23]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017

work page 2017

[24] [24]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

work page 2018

[25] [25]

Compact language models via pruning and knowledge distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. InAdvances in Neural Information Processing Systems, volume 37, pages 41076– 41102, 2024

work page 2024

[26] [26]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(1), 2020

work page 2020

[28] [28]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. InAAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, 2011

work page 2011

[29] [29]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[30] [30]

Replaceme: Network simplification via depth pruning and transformer block linearization

Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, and Sergey Zagoruyko. Replaceme: Network simplification via depth pruning and transformer block linearization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[31] [31]

Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks

Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[32] [32]

A simple and effective pruning approach for large language models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024. 11

work page 2024

[33] [33]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Smith, and Hannaneh Hajishirzi

Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Jiacheng...

work page 2025

[36] [36]

arXiv preprint arXiv:2310.06694 (2023)

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694, 2023

work page arXiv 2023

[37] [37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 12 Appendix: Ghosted Layers This appendix provides supplementary materials that complement the main paper. It includes the proof of ou...

work page 2019

[39] [39]

We adopt 32 sequences as the default since this is the smallest size at which downstream accuracy is already saturated, and the additional perplexity reduction from larger calibration sets does not translate into accuracy gains. D Fine-tuning results D.1 Fine-tuning setup We follow the fine-tuning protocol ofLinearPatch [5] exactly to ensure a fair head-t...

work page 2086