Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs
Pith reviewed 2026-05-19 16:10 UTC · model grok-4.3
The pith
A closed-form linear operator derived from calibration data can reconstruct the hidden-state mismatch caused by removing entire layers from large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. Ghosted Layers address this by solving a boundary activation alignment problem. The method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. This solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces.
What carries the argument
The closed-form optimal linear operator for boundary activation alignment, obtained by solving the least-squares problem on a calibration set to minimize the difference between pruned and original activations.
If this is right
- The method yields higher accuracy and lower perplexity than prior training-free recovery techniques on multiple LLM families and pruning strategies.
- The efficiency gains from layer pruning, such as reduced inference latency and memory use, remain intact because the added operator is a single matrix multiplication.
- Because the solution is the true unconstrained optimum rather than an approximation inside a restricted subspace, further improvements would require changing the objective itself rather than searching harder within the same family.
- The approach is training-free and uses only a small calibration set, so it can be applied after any pruning decision without additional optimization.
Where Pith is reading between the lines
- If the discrepancy introduced by pruning turns out to be largely linear, similar closed-form operators might correct other compression artifacts such as those from low-rank adaptation or early-exit mechanisms.
- The calibration-set requirement suggests that periodically refitting the operator on recent user data could keep recovery quality high when the input distribution shifts over time.
- Because the operator is derived once and then fixed, it could be fused into the adjacent layers at deployment time to eliminate any extra runtime cost beyond the original pruning savings.
Load-bearing premise
The activation discrepancy caused by removed layers can be accurately captured and reversed by one linear transformation that was fitted on limited calibration examples and then works for every input the model will see later.
What would settle it
If the linear operator fitted on the calibration set produces no measurable reduction in activation mismatch or no gain in perplexity and accuracy when tested on a large, held-out set of diverse inputs, the claim that it provides the effective unconstrained recovery would be falsified.
Figures
read the original abstract
Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. We propose Ghosted Layers, a training-free recovery module that addresses this issue by solving a boundary activation alignment problem. Our method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. We show that this solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces. Experiments across multiple LLM backbones and pruning strategies demonstrate that our method consistently improves accuracy and perplexity over prior training-free baselines, while preserving the efficiency gains of layer pruning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that layer pruning in LLMs creates a boundary activation mismatch that can be recovered training-free by Ghosted Layers: a closed-form optimal linear operator fitted on a small calibration set to reconstruct the pruned-layer activation discrepancy. It asserts this operator is the unconstrained optimum of the alignment objective (unlike prior methods limited to constrained subspaces), and reports consistent accuracy and perplexity gains over baselines across multiple LLM backbones and pruning strategies.
Significance. If the linear operator generalizes beyond the calibration distribution and the closed-form derivation is independent of downstream task loss, the approach would offer a lightweight, training-free way to mitigate pruning-induced degradation while retaining the efficiency benefits of layer removal. The parameter-free character of the claimed optimum would be a notable strength for reproducibility.
major comments (3)
- [Abstract / boundary activation alignment problem] Abstract and boundary activation alignment paragraph: the claim that a single linear operator fitted on a small calibration set reconstructs the discrepancy and remains effective across the full inference distribution lacks any robustness argument or bound on distribution shift; subsequent nonlinear Transformer layers can alter the required mapping, and no explicit test of this assumption is supplied.
- [Abstract] Abstract: the statement that the solution 'corresponds to the unconstrained optimum' is presented without derivation details, equations, or a proof that the operator is independent of the downstream task loss; without these, it is unclear whether the closed-form reduces to an empirical fit on the calibration data rather than a true unconstrained optimum.
- [Experiments] Experiments section: no error bars, standard deviations, or details on calibration-set selection and size are provided, making it impossible to assess whether the reported consistent gains are statistically reliable or sensitive to the choice of calibration data.
minor comments (2)
- The manuscript should include a clear statement of the exact least-squares objective and the resulting closed-form expression for the linear operator (presumably W = Y X^+ or equivalent) so readers can verify the unconstrained claim.
- Figure and table captions would benefit from explicit mention of the calibration-set size and the pruning ratios tested to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We have addressed each major comment below and revised the paper accordingly to improve clarity, rigor, and reproducibility.
read point-by-point responses
-
Referee: [Abstract / boundary activation alignment problem] Abstract and boundary activation alignment paragraph: the claim that a single linear operator fitted on a small calibration set reconstructs the discrepancy and remains effective across the full inference distribution lacks any robustness argument or bound on distribution shift; subsequent nonlinear Transformer layers can alter the required mapping, and no explicit test of this assumption is supplied.
Authors: We agree that a formal robustness bound or theoretical analysis of distribution shift would strengthen the claims. The linear operator is derived to minimize the immediate boundary mismatch on the calibration set, and while subsequent nonlinear layers can in principle modify the effective mapping, the alignment is applied precisely at the interface to reduce propagation of the discrepancy. Our experiments already test generalization across multiple models, pruning ratios, and evaluation datasets that differ from the calibration distribution. In the revision we have added a dedicated paragraph in Section 3.2 discussing the modeling assumptions and limitations, and we include new experiments evaluating performance on out-of-distribution prompts to provide more explicit empirical support for the assumption. revision: partial
-
Referee: [Abstract] Abstract: the statement that the solution 'corresponds to the unconstrained optimum' is presented without derivation details, equations, or a proof that the operator is independent of the downstream task loss; without these, it is unclear whether the closed-form reduces to an empirical fit on the calibration data rather than a true unconstrained optimum.
Authors: We appreciate this observation. The full derivation appears in Section 3, where we formulate the alignment objective as an unconstrained least-squares problem over the linear operator and obtain the closed-form solution via the normal equations; this solution depends only on the observed activation pairs from the calibration set and contains no dependence on any downstream task loss. To address the referee's concern we have expanded the abstract to include a brief reference to the derivation and added a pointer to the relevant equations (Eqs. 3–6) so that readers can immediately locate the proof that the operator is the unconstrained optimum. revision: yes
-
Referee: [Experiments] Experiments section: no error bars, standard deviations, or details on calibration-set selection and size are provided, making it impossible to assess whether the reported consistent gains are statistically reliable or sensitive to the choice of calibration data.
Authors: We thank the referee for noting this gap in reporting. The revised Experiments section now reports mean performance together with standard deviations and error bars computed over five independent random seeds for every metric and model. We have also added a new paragraph detailing the calibration-set construction: for each experiment we randomly sample 256 sequences (each of length 512 tokens) from the training split of the respective dataset, with the random seed fixed for reproducibility; sensitivity to calibration-set size is additionally explored in an appendix table. revision: yes
Circularity Check
No significant circularity; derivation is a standard closed-form solution to an explicitly stated alignment objective
full rationale
The paper defines a boundary activation alignment objective and derives its unconstrained optimum as a closed-form linear operator fitted on calibration activations. This is a direct mathematical solution to the stated minimization problem rather than a reduction of the claimed result to its own inputs by construction. No self-citations are invoked as load-bearing premises, no uniqueness theorems are imported from prior author work, and no fitted parameter is relabeled as an independent prediction. The central claim remains that the derived operator is unconstrained (in contrast to prior constrained subspaces), which follows from the problem formulation itself without tautology. Generalization from calibration to inference is an empirical assumption but does not render the derivation chain circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The activation discrepancy at layer boundaries after pruning is reconstructible by a linear operator derived from a small calibration set.
Reference graph
Works this paper leans on
-
[1]
Fluctuation-based adaptive structured pruning for large language models
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence. AAAI Press, 2024
work page 2024
-
[2]
Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...
work page 1901
-
[4]
Streamlining redundant lay- ers to compress large language models
Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. Streamlining redundant lay- ers to compress large language models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[5]
A simple linear patch revives layer-pruned large language models
Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, and Chun Yuan. A simple linear patch revives layer-pruned large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[6]
Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, and Chun Yuan. Prune&comp: Free lunch for layer-pruned LLMs via iterative pruning with magnitude compensation.arXiv preprint arXiv:2507.18212, 2025
-
[7]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...
work page 2019
-
[8]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
The PASCAL recognising textual entailment challenge
Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177–190. Springer, 2005
work page 2005
-
[10]
Sparsegpt: Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023
-
[11]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
work page 2024
-
[12]
Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, Baltimore, MD, fourth edition, 2013. 10
work page 2013
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The LLaMA 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
The unreasonable ineffectiveness of the deeper layers
Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. The unreasonable ineffectiveness of the deeper layers. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[15]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025
Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025
work page 2025
-
[16]
Higham.Accuracy and Stability of Numerical Algorithms
Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, second edition, 2002
work page 2002
-
[17]
arXiv preprint arXiv:2402.02834 , volume=
Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened LLaMA: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834, 2024
-
[18]
RACE: Large-scale ReAding compre- hension dataset from examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding compre- hension dataset from examinations. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, 2017
work page 2017
-
[19]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019
work page 2019
-
[20]
LLM-pruner: On the structural pruning of large language models
Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-pruner: On the structural pruning of large language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[21]
Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993
work page 1993
-
[22]
ShortGPT: Layers in large language models are more redundant than you expect
Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, July 2025
work page 2025
-
[23]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017
work page 2017
-
[24]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018
work page 2018
-
[25]
Compact language models via pruning and knowledge distillation
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. InAdvances in Neural Information Processing Systems, volume 37, pages 41076– 41102, 2024
work page 2024
-
[26]
OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(1), 2020
work page 2020
-
[28]
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. InAAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, 2011
work page 2011
-
[29]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[30]
Replaceme: Network simplification via depth pruning and transformer block linearization
Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, and Sergey Zagoruyko. Replaceme: Network simplification via depth pruning and transformer block linearization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[31]
Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks
Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. InProceedings of the 41st International Conference on Machine Learning, 2024
work page 2024
-
[32]
A simple and effective pruning approach for large language models
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024. 11
work page 2024
-
[33]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Smith, and Hannaneh Hajishirzi
Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Jiacheng...
work page 2025
-
[36]
arXiv preprint arXiv:2310.06694 (2023)
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694, 2023
-
[37]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 12 Appendix: Ghosted Layers This appendix provides supplementary materials that complement the main paper. It includes the proof of ou...
work page 2019
-
[39]
We adopt 32 sequences as the default since this is the smallest size at which downstream accuracy is already saturated, and the additional perplexity reduction from larger calibration sets does not translate into accuracy gains. D Fine-tuning results D.1 Fine-tuning setup We follow the fine-tuning protocol ofLinearPatch [5] exactly to ensure a fair head-t...
work page 2086
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.