pith. machine review for the scientific record. sign in

arxiv: 2602.01997 · v2 · submitted 2026-02-02 · 💻 cs.LG · cs.AI

Recognition: no theorem link

On the Limits of Layer Pruning for Generative Reasoning in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords layer pruninglarge language modelsgenerative reasoningmodel compressionarithmetic capabilitiesfinetuning recoveryGSM8Kalgorithmic capabilities
0
0 comments X

The pith

Layer pruning removes algorithmic capabilities from LLMs that supervised finetuning on hundreds of billions of tokens fails to restore.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether layer pruning, known to compress LLMs with little loss on classification, can be used for generative reasoning tasks. The authors show that pruning eliminates core skills such as arithmetic computation and balanced parenthesis generation. A recovery approach using supervised finetuning on self-generated responses restores most classification performance but leaves reasoning tasks far below baseline. Even after training on roughly 400 billion tokens, performance on simple arithmetic and GSM8K remains limited. The work maps the practical boundaries of depth reduction when only standard post-training resources are available.

Core claim

Pruning layers from large language models leads to a loss of algorithmic capabilities, including arithmetic computation and balanced parenthesis generation. Supervised finetuning with self-generated responses recovers up to 90 percent of baseline performance on classification tasks but fails to restore generative reasoning performance even after finetuning on approximately 400 billion tokens. This limitation appears on simple tasks that do not require multi-step generation, indicating that the capabilities are not easily restored under realistic post-training constraints without pretraining-scale data or compute.

What carries the argument

Layer pruning followed by supervised finetuning on self-generated responses as the test of recoverability for lost algorithmic capabilities.

If this is right

  • Classification tasks tolerate layer pruning with high recovery under minimal finetuning.
  • Generative reasoning tasks exhibit persistent performance gaps that finetuning does not close.
  • Basic arithmetic accuracy drops after pruning and remains low even after large-scale finetuning.
  • Depth reduction works under constrained post-training regimes mainly for non-reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Specific layers appear to encode irreplaceable routines for basic computation that post-training cannot rebuild.
  • Pruning strategies may need to be combined with pretraining adjustments to preserve reasoning.
  • Similar tests on other minimal generative tasks could identify which skills are most vulnerable to layer removal.

Load-bearing premise

Supervised finetuning on self-generated responses under realistic post-training constraints is a sufficient test of whether algorithmic capabilities lost to pruning can be recovered.

What would settle it

Full recovery of original arithmetic accuracy and GSM8K scores after supervised finetuning on 400 billion tokens would show the capabilities are restorable.

Figures

Figures reproduced from arXiv: 2602.01997 by Aadim Nepal, Anubhav Shrestha, Keith Ross, Minwu Kim, Safal Shrestha.

Figure 1
Figure 1. Figure 1: Comparison of performance retention (normalized to baseline) between SFT with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of removing a single layer on model performance across generative bench [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Text degeneration under single￾layer pruning, measured using 4-gram rep￾etition (left) and Self-BLEU4 averaged across responses and normalized relative to the base￾line. Text degeneration (Holtzman et al., 2019) is a commonly observed failure mode in pruned language models and can hinder instruction following and coherent genera￾tion. We quantify degeneration using two complementary metrics computed with 4… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of Model Performance and Error Types under Single-Layer Pruning. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of model performance across arithmetic and coding benchmarks [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average accuracy on generative tasks for Qwen and LLaMA across prun￾ing strategies and ratios, alongside model throughput. Throughput (tokens/s) is shown on the secondary axis (dotted line). Even at moderate pruning ratio(25%), a substantial gap remains in generative vs classification tasks. Additional details are provided in Ap￾pendix A.9. While prior sections highlight the limi￾tations of aggressive prun… view at source ↗
Figure 7
Figure 7. Figure 7: Text degeneration results with layer pruning using N-gram repetition (left) and [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of single-layer pruning on the arithmetic ability of various models. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Full Finetuning with GSM8K dataset on the Qwen Iterative Pruned Model. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison between full supervised finetuning (Full-FT) and QLoRA on self [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Perplexity curves during training for both standard finetuning and for SGR for [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Differences between finetuning with Self-Generated Responses (SGR) vs on Dolci [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

Recent work has shown that layer pruning can effectively compress large language models (LLMs) while retaining strong performance on classification benchmarks, often with little or no finetuning. In contrast, generative reasoning tasks, such as GSM8K and HumanEval\textsuperscript{+}, exhibit substantially weaker recovery. We show that beyond surface-level text degradation, pruning leads to a loss of key algorithmic capabilities, including arithmetic computation and balanced parenthesis generation. Under realistic post-training constraints, without access to pretraining-scale data or compute, we evaluate a minimal recovery strategy based on supervised finetuning with self-generated responses. This approach recovers up to 90\% of baseline performance on classification tasks, but recovery for generative reasoning remains fundamentally limited. Notably, even models finetuned on $\sim$400B tokens after pruning fail to recover their original reasoning performance, suggesting that such capabilities are not as easily restored. This limitation persists even on simple tasks such as arithmetic, which do not require multi-step generation. Overall, we characterize the practical limits of layer pruning for generative reasoning and provide guidance on when depth reduction is effective under constrained post-training regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that layer pruning effectively compresses LLMs for classification tasks with minimal finetuning but causes substantial loss of generative reasoning capabilities (e.g., arithmetic computation and balanced parenthesis generation) on tasks like GSM8K and HumanEval+. Under realistic post-training constraints, supervised finetuning on self-generated responses recovers up to 90% of baseline performance on classification but fails to restore original reasoning performance even after training on ~400B tokens, indicating that such algorithmic capabilities are not easily restored.

Significance. If the central empirical findings hold after addressing experimental controls, the work would usefully characterize the practical limits of depth reduction for preserving reasoning in LLMs, providing guidance on when pruning is viable under constrained post-training regimes and highlighting that simple tasks like arithmetic remain affected.

major comments (2)
  1. [Abstract] Abstract (recovery strategy description): the headline result that ~400B-token SFT fails to recover arithmetic and generative reasoning depends on using self-generated responses from the already-pruned model as supervision. Without a control arm that uses ground-truth labels or teacher-generated correct responses, the observed non-recovery cannot distinguish permanent algorithmic loss from the effects of training on predominantly incorrect targets; this is load-bearing for the irrecoverability claim.
  2. [Abstract] Abstract (experimental outcomes paragraph): the reported 90% recovery on classification versus limited recovery on reasoning is presented at a high level without specific pruning ratios, model architectures, data splits, or error bars, leaving the magnitude and robustness of the differential effect difficult to evaluate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to improve clarity on the experimental design and results.

read point-by-point responses
  1. Referee: [Abstract] Abstract (recovery strategy description): the headline result that ~400B-token SFT fails to recover arithmetic and generative reasoning depends on using self-generated responses from the already-pruned model as supervision. Without a control arm that uses ground-truth labels or teacher-generated correct responses, the observed non-recovery cannot distinguish permanent algorithmic loss from the effects of training on predominantly incorrect targets; this is load-bearing for the irrecoverability claim.

    Authors: We agree this is a substantive point for interpreting the irrecoverability result. Our choice to use self-generated responses from the pruned model was intentional to evaluate recovery under realistic post-training constraints (no access to ground-truth labels or external teachers at scale). We acknowledge that this setup leaves open the possibility that poor supervision contributes to limited recovery. In the revised manuscript we expand the discussion section to explicitly note this potential confound and clarify that the headline claim is specifically about recovery difficulty under noisy self-supervision. The abstract has been updated to describe the recovery strategy more precisely. revision: partial

  2. Referee: [Abstract] Abstract (experimental outcomes paragraph): the reported 90% recovery on classification versus limited recovery on reasoning is presented at a high level without specific pruning ratios, model architectures, data splits, or error bars, leaving the magnitude and robustness of the differential effect difficult to evaluate.

    Authors: The abstract is intentionally concise; full experimental details (pruning ratios of 25-50% layers removed, Llama-2 7B/13B and Mistral-7B architectures, standard data splits for GSM8K and HumanEval+, and error bars from 3-5 runs) appear in Sections 3 and 4. To address the concern we have revised the abstract to include the main pruning ratio used and a note that all reported percentages include standard deviations from multiple runs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements only

full rationale

This is a purely empirical study reporting benchmark performance before/after layer pruning and after supervised finetuning on self-generated data. No equations, parameter fittings, or derivations appear anywhere in the manuscript. Claims rest on direct accuracy numbers (e.g., recovery percentages on GSM8K, arithmetic, HumanEval) rather than any self-referential definition, fitted-input-as-prediction, or load-bearing self-citation of a uniqueness theorem. The choice to use self-generated responses for SFT is an experimental design decision whose validity can be evaluated externally; it does not reduce the reported results to the inputs by construction. Therefore the derivation chain contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable beyond standard assumptions of LLM training and evaluation.

pith-pipeline@v0.9.0 · 5512 in / 1072 out tokens · 24381 ms · 2026-05-16T08:37:20.362229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 14 internal anchors

  1. [1]

    arXiv: 2401.15024

    Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024,

  2. [2]

    Streamlining redundant layers to compress large language models.arXiv preprint arXiv:2403.19135,

    Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. Streamlining redundant layers to compress large language models.arXiv preprint arXiv:2403.19135,

  3. [3]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  5. [5]

    Documenting large webtext corpora: A case study on the colossal clean crawled corpus

    Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groen- eveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 1286–1305,

  6. [6]

    Olmo 3

    Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961,

  7. [7]

    Sparsegpt: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774,

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774,

  8. [8]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

  9. [9]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  10. [10]

    The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887,

    Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887,

  11. [11]

    Measuring Massive Multitask Language Understanding

    10 Preprint. Under review. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  12. [12]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  13. [13]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751,

  14. [14]

    Auxiliary task demands mask the capabilities of smaller language models.arXiv preprint arXiv:2404.02418,

    Jennifer Hu and Michael C Frank. Auxiliary task demands mask the capabilities of smaller language models.arXiv preprint arXiv:2404.02418,

  15. [15]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:...

  16. [16]

    arXiv: 2402.02834

    Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834,

  17. [17]

    The remarkable robustness of llms: Stages of inference?arXiv preprint arXiv:2406.19384,

    Vedang Lad, Jin Hwa Lee, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference?arXiv preprint arXiv:2406.19384,

  18. [18]

    Reassessing layer pruning in llms: New insights and methods

    Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, and Zhaowei Zhu. Reassessing layer pruning in llms: New insights and methods. arXiv preprint arXiv:2411.15558,

  19. [19]

    Shortgpt: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 20192–20204,

  20. [20]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,

  21. [21]

    Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

    Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.arXiv preprint arXiv:1808.08745,

  22. [22]

    Layer importance for mathematical reasoning is forged in pre-training and invariant after post-training.arXiv preprint arXiv:2506.22638,

    Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Jalal Naghiyev, Ravid Shwartz-Ziv, and Keith Ross. Layer importance for mathematical reasoning is forged in pre-training and invariant after post-training.arXiv preprint arXiv:2506.22638,

  23. [23]

    Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks.arXiv preprint arXiv:2402.09025,

    Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks.arXiv preprint arXiv:2402.09025,

  24. [24]

    Demystifying the roles of llm layers in retrieval, knowledge, and reasoning.arXiv preprint arXiv:2510.02091,

    Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, and Shiwei Liu. Demystifying the roles of llm layers in retrieval, knowledge, and reasoning.arXiv preprint arXiv:2510.02091,

  25. [25]

    Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

    Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

  26. [26]

    A Simple and Effective Pruning Approach for Large Language Models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695,

  27. [27]

    The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

    Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

  28. [28]

    URL https://crfm.stanford.edu/2023/03/13/alpaca. html. Stanford Center for Research on Foundation Models (CRFM). Vithursan Thangarasa, Ganesh Venkatesh, Mike Lasby, Nish Sinnadurai, and Sean Lie. Self-data distillation for recovering quality in pruned large language models.Proceedings of Machine Learning and Systems, 7,

  29. [29]

    Efficient large language models: A survey.arXiv preprint arXiv:2312.03863,

    Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863,

  30. [30]

    arXiv: 2510.22228

    Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, and Shiwei Liu. When fewer layers break more chains: Layer pruning harms test-time scaling in llms. arXiv preprint arXiv:2510.22228,

  31. [31]

    arXiv:2310.06694 [cs]

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694,

  32. [32]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. 12 Preprint. Under review. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang...

  33. [33]

    Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.arXiv preprint arXiv:2310.05175, 2023

    Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, et al. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.arXiv preprint arXiv:2310.05175,

  34. [34]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

  35. [35]

    A Appendix A.1 Additional Alignment in Minitron Sreenivas et al. (2024) report final results after multiple post-distillation alignment stages, including supervised finetuning on math and code data, instruction tuning, and preference optimization. These stages are applied after the distillation phase. The publicly released Llama-3.1-Minitron-4B-Depth chec...

  36. [36]

    We focus on Dolci since our broader objective is to assess whether post-pruning training can preserve generative reasoning performance

    ( ∼ 90K), 1 epoch. We focus on Dolci since our broader objective is to assess whether post-pruning training can preserve generative reasoning performance. We rely on QLoRA because it is comparable to even LoRA trained models in our experiments (see Table 1). We also show in A.6.3 that QLoRA closely matches performance of Full finetuning for recovery in ou...

  37. [37]

    Algorithm 1Greedy Iterative Pruning via Benchmark Performance Input:ModelMwithLlayers, benchmark datasetD, number of layers to pruneN Output:Pruned layer setP P ←∅ fork=1 toNdo ℓ⋆ ←arg max ℓ∈{1,...,L}\P Score M−(P ∪{ℓ}),D P ← P ∪ {ℓ ⋆} end for ReturnP A.6.2 Upper-Bound Recovery on GSM8K To estimate an upper bound on recoverable performance under layer pru...