arxiv: 2602.01997 · v2 · submitted 2026-02-02 · 💻 cs.LG · cs.AI

Recognition: no theorem link

On the Limits of Layer Pruning for Generative Reasoning in Large Language Models

Safal Shrestha , Anubhav Shrestha , Aadim Nepal , Minwu Kim , Keith Ross

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords layer pruninglarge language modelsgenerative reasoningmodel compressionarithmetic capabilitiesfinetuning recoveryGSM8Kalgorithmic capabilities

0 comments

The pith

Layer pruning removes algorithmic capabilities from LLMs that supervised finetuning on hundreds of billions of tokens fails to restore.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether layer pruning, known to compress LLMs with little loss on classification, can be used for generative reasoning tasks. The authors show that pruning eliminates core skills such as arithmetic computation and balanced parenthesis generation. A recovery approach using supervised finetuning on self-generated responses restores most classification performance but leaves reasoning tasks far below baseline. Even after training on roughly 400 billion tokens, performance on simple arithmetic and GSM8K remains limited. The work maps the practical boundaries of depth reduction when only standard post-training resources are available.

Core claim

Pruning layers from large language models leads to a loss of algorithmic capabilities, including arithmetic computation and balanced parenthesis generation. Supervised finetuning with self-generated responses recovers up to 90 percent of baseline performance on classification tasks but fails to restore generative reasoning performance even after finetuning on approximately 400 billion tokens. This limitation appears on simple tasks that do not require multi-step generation, indicating that the capabilities are not easily restored under realistic post-training constraints without pretraining-scale data or compute.

What carries the argument

Layer pruning followed by supervised finetuning on self-generated responses as the test of recoverability for lost algorithmic capabilities.

If this is right

Classification tasks tolerate layer pruning with high recovery under minimal finetuning.
Generative reasoning tasks exhibit persistent performance gaps that finetuning does not close.
Basic arithmetic accuracy drops after pruning and remains low even after large-scale finetuning.
Depth reduction works under constrained post-training regimes mainly for non-reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Specific layers appear to encode irreplaceable routines for basic computation that post-training cannot rebuild.
Pruning strategies may need to be combined with pretraining adjustments to preserve reasoning.
Similar tests on other minimal generative tasks could identify which skills are most vulnerable to layer removal.

Load-bearing premise

Supervised finetuning on self-generated responses under realistic post-training constraints is a sufficient test of whether algorithmic capabilities lost to pruning can be recovered.

What would settle it

Full recovery of original arithmetic accuracy and GSM8K scores after supervised finetuning on 400 billion tokens would show the capabilities are restorable.

Figures

Figures reproduced from arXiv: 2602.01997 by Aadim Nepal, Anubhav Shrestha, Keith Ross, Minwu Kim, Safal Shrestha.

**Figure 2.** Figure 2: Effect of removing a single layer on model performance across generative bench [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Text degeneration under singlelayer pruning, measured using 4-gram repetition (left) and Self-BLEU4 averaged across responses and normalized relative to the baseline. Text degeneration (Holtzman et al., 2019) is a commonly observed failure mode in pruned language models and can hinder instruction following and coherent generation. We quantify degeneration using two complementary metrics computed with 4… view at source ↗

**Figure 4.** Figure 4: Analysis of Model Performance and Error Types under Single-Layer Pruning. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of model performance across arithmetic and coding benchmarks [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Average accuracy on generative tasks for Qwen and LLaMA across pruning strategies and ratios, alongside model throughput. Throughput (tokens/s) is shown on the secondary axis (dotted line). Even at moderate pruning ratio(25%), a substantial gap remains in generative vs classification tasks. Additional details are provided in Appendix A.9. While prior sections highlight the limitations of aggressive prun… view at source ↗

**Figure 7.** Figure 7: Text degeneration results with layer pruning using N-gram repetition (left) and [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of single-layer pruning on the arithmetic ability of various models. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Full Finetuning with GSM8K dataset on the Qwen Iterative Pruned Model. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison between full supervised finetuning (Full-FT) and QLoRA on self [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Perplexity curves during training for both standard finetuning and for SGR for [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Differences between finetuning with Self-Generated Responses (SGR) vs on Dolci [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Recent work has shown that layer pruning can effectively compress large language models (LLMs) while retaining strong performance on classification benchmarks, often with little or no finetuning. In contrast, generative reasoning tasks, such as GSM8K and HumanEval\textsuperscript{+}, exhibit substantially weaker recovery. We show that beyond surface-level text degradation, pruning leads to a loss of key algorithmic capabilities, including arithmetic computation and balanced parenthesis generation. Under realistic post-training constraints, without access to pretraining-scale data or compute, we evaluate a minimal recovery strategy based on supervised finetuning with self-generated responses. This approach recovers up to 90\% of baseline performance on classification tasks, but recovery for generative reasoning remains fundamentally limited. Notably, even models finetuned on $\sim$400B tokens after pruning fail to recover their original reasoning performance, suggesting that such capabilities are not as easily restored. This limitation persists even on simple tasks such as arithmetic, which do not require multi-step generation. Overall, we characterize the practical limits of layer pruning for generative reasoning and provide guidance on when depth reduction is effective under constrained post-training regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Layer pruning tanks reasoning recovery even after 400B self-finetuning tokens, but the noisy self-generated labels leave open whether the skills are truly lost or just badly taught.

read the letter

Pruning layers in LLMs keeps classification performance mostly intact but damages generative reasoning in a way that resists recovery. The paper shows this gap clearly on GSM8K, HumanEval, arithmetic, and parenthesis tasks, where even hundreds of billions of tokens of self-finetuning only produce limited gains. That contrast between task types and the focus on lost algorithmic primitives is the main new piece here, and it goes beyond the usual pruning papers that stop at classification benchmarks. The scale of the recovery run under realistic post-training limits is also a solid data point worth having on record. Classification recovers to about 90 percent while reasoning stays stuck, which matches the claim that depth reduction hits step-by-step generation harder. The data and methods descriptions in the abstract look straightforward for an empirical study, with no obvious circularity or invented metrics. The soft spot is the recovery setup. Self-generated responses from an already-weak model are likely full of errors on arithmetic and reasoning examples, so supervised training on those targets cannot cleanly test whether the underlying capabilities are gone or simply not learnable from bad labels. No control with ground-truth data is mentioned, which weakens the stronger version of the irrecoverability conclusion. This is a real but fixable issue rather than a fatal one. The work is aimed at people building or compressing reasoning models for deployment. A reader who needs practical guidance on when pruning is safe would get direct value from the numbers, even if the interpretation needs tightening in revision. It deserves peer review so the methods and data splits can be checked in detail.

Referee Report

2 major / 0 minor

Summary. The paper claims that layer pruning effectively compresses LLMs for classification tasks with minimal finetuning but causes substantial loss of generative reasoning capabilities (e.g., arithmetic computation and balanced parenthesis generation) on tasks like GSM8K and HumanEval+. Under realistic post-training constraints, supervised finetuning on self-generated responses recovers up to 90% of baseline performance on classification but fails to restore original reasoning performance even after training on ~400B tokens, indicating that such algorithmic capabilities are not easily restored.

Significance. If the central empirical findings hold after addressing experimental controls, the work would usefully characterize the practical limits of depth reduction for preserving reasoning in LLMs, providing guidance on when pruning is viable under constrained post-training regimes and highlighting that simple tasks like arithmetic remain affected.

major comments (2)

[Abstract] Abstract (recovery strategy description): the headline result that ~400B-token SFT fails to recover arithmetic and generative reasoning depends on using self-generated responses from the already-pruned model as supervision. Without a control arm that uses ground-truth labels or teacher-generated correct responses, the observed non-recovery cannot distinguish permanent algorithmic loss from the effects of training on predominantly incorrect targets; this is load-bearing for the irrecoverability claim.
[Abstract] Abstract (experimental outcomes paragraph): the reported 90% recovery on classification versus limited recovery on reasoning is presented at a high level without specific pruning ratios, model architectures, data splits, or error bars, leaving the magnitude and robustness of the differential effect difficult to evaluate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to improve clarity on the experimental design and results.

read point-by-point responses

Referee: [Abstract] Abstract (recovery strategy description): the headline result that ~400B-token SFT fails to recover arithmetic and generative reasoning depends on using self-generated responses from the already-pruned model as supervision. Without a control arm that uses ground-truth labels or teacher-generated correct responses, the observed non-recovery cannot distinguish permanent algorithmic loss from the effects of training on predominantly incorrect targets; this is load-bearing for the irrecoverability claim.

Authors: We agree this is a substantive point for interpreting the irrecoverability result. Our choice to use self-generated responses from the pruned model was intentional to evaluate recovery under realistic post-training constraints (no access to ground-truth labels or external teachers at scale). We acknowledge that this setup leaves open the possibility that poor supervision contributes to limited recovery. In the revised manuscript we expand the discussion section to explicitly note this potential confound and clarify that the headline claim is specifically about recovery difficulty under noisy self-supervision. The abstract has been updated to describe the recovery strategy more precisely. revision: partial
Referee: [Abstract] Abstract (experimental outcomes paragraph): the reported 90% recovery on classification versus limited recovery on reasoning is presented at a high level without specific pruning ratios, model architectures, data splits, or error bars, leaving the magnitude and robustness of the differential effect difficult to evaluate.

Authors: The abstract is intentionally concise; full experimental details (pruning ratios of 25-50% layers removed, Llama-2 7B/13B and Mistral-7B architectures, standard data splits for GSM8K and HumanEval+, and error bars from 3-5 runs) appear in Sections 3 and 4. To address the concern we have revised the abstract to include the main pruning ratio used and a note that all reported percentages include standard deviations from multiple runs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements only

full rationale

This is a purely empirical study reporting benchmark performance before/after layer pruning and after supervised finetuning on self-generated data. No equations, parameter fittings, or derivations appear anywhere in the manuscript. Claims rest on direct accuracy numbers (e.g., recovery percentages on GSM8K, arithmetic, HumanEval) rather than any self-referential definition, fitted-input-as-prediction, or load-bearing self-citation of a uniqueness theorem. The choice to use self-generated responses for SFT is an experimental design decision whose validity can be evaluated externally; it does not reduce the reported results to the inputs by construction. Therefore the derivation chain contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable beyond standard assumptions of LLM training and evaluation.

pith-pipeline@v0.9.0 · 5512 in / 1072 out tokens · 24381 ms · 2026-05-16T08:37:20.362229+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 14 internal anchors

[1]

arXiv: 2401.15024

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024,

work page arXiv
[2]

Streamlining redundant layers to compress large language models.arXiv preprint arXiv:2403.19135,

Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. Streamlining redundant layers to compress large language models.arXiv preprint arXiv:2403.19135,

work page arXiv
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groen- eveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 1286–1305,

work page 2021
[6]

Olmo 3

Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Sparsegpt: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774,

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774,

work page arXiv
[8]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887,

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887,

work page arXiv
[11]

Measuring Massive Multitask Language Understanding

10 Preprint. Under review. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[14]

Auxiliary task demands mask the capabilities of smaller language models.arXiv preprint arXiv:2404.02418,

Jennifer Hu and Michael C Frank. Auxiliary task demands mask the capabilities of smaller language models.arXiv preprint arXiv:2404.02418,

work page arXiv
[15]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

arXiv: 2402.02834

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834,

work page arXiv
[17]

The remarkable robustness of llms: Stages of inference?arXiv preprint arXiv:2406.19384,

Vedang Lad, Jin Hwa Lee, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference?arXiv preprint arXiv:2406.19384,

work page arXiv
[18]

Reassessing layer pruning in llms: New insights and methods

Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, and Zhaowei Zhu. Reassessing layer pruning in llms: New insights and methods. arXiv preprint arXiv:2411.15558,

work page arXiv
[19]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 20192–20204,

work page 2025
[20]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.arXiv preprint arXiv:1808.08745,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Layer importance for mathematical reasoning is forged in pre-training and invariant after post-training.arXiv preprint arXiv:2506.22638,

Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Jalal Naghiyev, Ravid Shwartz-Ziv, and Keith Ross. Layer importance for mathematical reasoning is forged in pre-training and invariant after post-training.arXiv preprint arXiv:2506.22638,

work page arXiv
[23]

Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks.arXiv preprint arXiv:2402.09025,

Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks.arXiv preprint arXiv:2402.09025,

work page arXiv
[24]

Demystifying the roles of llm layers in retrieval, knowledge, and reasoning.arXiv preprint arXiv:2510.02091,

Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, and Shiwei Liu. Demystifying the roles of llm layers in retrieval, knowledge, and reasoning.arXiv preprint arXiv:2510.02091,

work page arXiv
[25]

Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

work page arXiv
[26]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695,

work page internal anchor Pith review arXiv
[27]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

work page arXiv
[28]

URL https://crfm.stanford.edu/2023/03/13/alpaca. html. Stanford Center for Research on Foundation Models (CRFM). Vithursan Thangarasa, Ganesh Venkatesh, Mike Lasby, Nish Sinnadurai, and Sean Lie. Self-data distillation for recovering quality in pruned large language models.Proceedings of Machine Learning and Systems, 7,

work page 2023
[29]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863,

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863,

work page arXiv
[30]

arXiv: 2510.22228

Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, and Shiwei Liu. When fewer layers break more chains: Layer pruning harms test-time scaling in llms. arXiv preprint arXiv:2510.22228,

work page arXiv
[31]

arXiv:2310.06694 [cs]

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694,

work page arXiv
[32]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. 12 Preprint. Under review. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang...

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.arXiv preprint arXiv:2310.05175, 2023

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, et al. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.arXiv preprint arXiv:2310.05175,

work page arXiv
[34]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[35]

A Appendix A.1 Additional Alignment in Minitron Sreenivas et al. (2024) report final results after multiple post-distillation alignment stages, including supervised finetuning on math and code data, instruction tuning, and preference optimization. These stages are applied after the distillation phase. The publicly released Llama-3.1-Minitron-4B-Depth chec...

work page 2024
[36]

We focus on Dolci since our broader objective is to assess whether post-pruning training can preserve generative reasoning performance

( ∼ 90K), 1 epoch. We focus on Dolci since our broader objective is to assess whether post-pruning training can preserve generative reasoning performance. We rely on QLoRA because it is comparable to even LoRA trained models in our experiments (see Table 1). We also show in A.6.3 that QLoRA closely matches performance of Full finetuning for recovery in ou...

work page 2025
[37]

Algorithm 1Greedy Iterative Pruning via Benchmark Performance Input:ModelMwithLlayers, benchmark datasetD, number of layers to pruneN Output:Pruned layer setP P ←∅ fork=1 toNdo ℓ⋆ ←arg max ℓ∈{1,...,L}\P Score M−(P ∪{ℓ}),D P ← P ∪ {ℓ ⋆} end for ReturnP A.6.2 Upper-Bound Recovery on GSM8K To estimate an upper bound on recoverable performance under layer pru...

work page 2024