TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

Aman Chadha; Krish Sharma; Nicholas Asher; Omar Naim; Soumadeep Saha; Vinija Jain

arxiv: 2605.14738 · v2 · pith:I3A4BUP5new · submitted 2026-05-14 · 💻 cs.LG · cs.AI

TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

Krish Sharma , Omar Naim , Soumadeep Saha , Vinija Jain , Aman Chadha , Nicholas Asher This is my paper

Pith reviewed 2026-05-22 10:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords task-aware pruningout-of-distribution generalizationrepresentational geometrylayer pruningdistribution shiftlarge language modelsnorm profiles

0 comments

The pith

Task-aware pruning improves OOD accuracy by removing layers that distort task-adapted geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines task-aware layer pruning and finds it brings no gain on in-distribution data yet reliably raises accuracy on out-of-distribution inputs in both polynomial regression and large language models. OOD examples produce layerwise norm and pairwise-distance profiles that diverge from the profiles seen on the task-adapted distribution. Pruning targets the layers that generate or enlarge this mismatch, pulling OOD representations back toward the geometry observed on in-distribution inputs. A reader would care because the result supplies a concrete geometric account for why selective pruning can improve robustness to distribution shifts without retraining the whole model.

Core claim

Task-aware pruning identifies layers that create or amplify distortion for OOD inputs; by removing them it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution and improves performance. Across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution data but consistently improves out-of-distribution accuracy. OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles, and residual-scaling interventions supply causal evidence for the realignment effect.

What carries the argument

Task-adapted geometry, characterized by the layerwise norm and pairwise-distance profiles measured on ID inputs, which OOD inputs distort and which task-aware pruning corrects by layer removal.

Load-bearing premise

Deviations in layerwise norm and pairwise-distance profiles for OOD inputs amount to a correctable distortion of a task-adapted geometry rather than unrelated variation.

What would settle it

A controlled shift experiment in which the layers selected by task-aware pruning are removed yet the OOD norm and distance profiles remain as far from the ID profiles as before, or the profiles move closer but OOD accuracy does not rise.

Figures

Figures reproduced from arXiv: 2605.14738 by Aman Chadha, Krish Sharma, Nicholas Asher, Omar Naim, Soumadeep Saha, Vinija Jain.

**Figure 2.** Figure 2: Threshold analyses over linear functions sampled from [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Pruning realigns OOD representations toward the in-distribution geometry. Top: regression-task results for L2 median distance from the final token to prior tokens. (a) The model is trained on U(−1, 1) and tested on U(1, 2): OOD distances inflate to ∼385, and TALE contracts them toward the ID trajectory. (b) With train/test roles reversed, pruning expands OOD distances toward the ID baseline, showing that T… view at source ↗

**Figure 4.** Figure 4: Distribution-dependence of the layer-3 linear surrogate [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy on MMLU high-school mathematics, using 2-shot evaluation, under different [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Plot of L2 pair distances across GSM8K and Winogrande with Llama [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: L2 distances before and after TALE pruning on GSM8k, BigBench (both on LLama 3.1 8b) and on Boolq (Lucie 7b) I Linear-surrogate diagnostics: extended results This appendix gives the per-cell histograms and layer analyses surrogate analysis in Section 4.3. One-sided expansion vs. two-sided refinement. Section 4.3 reported median norm gain only. The shape of the gain distribution turns out to carry more info… view at source ↗

**Figure 8.** Figure 8: Performance gain (∆ accuracy from baseline) as a function of residual scaling α. The intervention is consistent, on the other hand, with the our geometrical view: this particular layer contributes a positive-on-average residual update on OOD inputs, and reducing the magnitude of that update at test time reduces the OOD geometric distortion the layer introduces. The monotone α-accuracy curve is what the mag… view at source ↗

**Figure 9.** Figure 9: Plots for Qwen on Winogrande data set the output through all layers. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Plots for Llama on Winogrande data set the output through all layers. The blue curve is [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Plots for Llama on Big Bench data set the output through all layers. [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: Plots for Lucie on MMLU data set with output through all layers. [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Plots for Lucie on BoolQ data set with output through all layers. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Layerwise predictions on a 12 layer 8 attention heads transformer trained on [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: small transformer trained on U(-1,1) with OOD data set U(1,2). Dashed lines are [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: small transformer trained on U(1,2) with OOD data set U(-1,1). Dashed lines are [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: The L1 analogue of the L2 analysis in [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

read the original abstract

Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model's task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task-aware pruning improves OOD by pruning layers that distort representation profiles, with a geometric explanation and some intervention evidence, but the causal mechanism is not fully locked down.

read the letter

The main takeaway is that task-aware pruning boosts OOD accuracy by identifying and removing layers that create or amplify distortions in layerwise norms and pairwise distances for shifted inputs, moving those profiles closer to the ones seen on the task-adapted distribution. They report no ID gains but consistent OOD improvements on both polynomial regression tasks and LLMs, plus profile shifts after pruning. Controlled distribution shifts and residual-scaling interventions are offered as causal support. This geometric framing using representation profiles is the new piece on top of earlier task-aware pruning results like TALE. The patterns hold across scales and the interventions move beyond pure correlation, which is a step forward. The work stays focused and does not overclaim ID benefits. The softer spot is that the story treats OOD profile deviations as removable distortions of a task geometry rather than variation that happens to correlate with pruning gains. The interventions show that pruning affects both the profiles and performance, yet they do not test whether aligning the profiles by some other route, independent of pruning, would recover the accuracy lift. That leaves room for alternative accounts such as simply dropping high-sensitivity layers. This paper is for people working on pruning methods or OOD robustness in neural networks. Readers who want mechanistic accounts of why pruning helps under distribution shift will find the framing and experiments worth their time. It has enough new explanation and empirical consistency to deserve a serious referee rather than a desk reject, mainly to pressure-test the causal link between the measured geometry and the performance change.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the mechanisms behind task-aware pruning's benefits for out-of-distribution (OOD) generalization. It demonstrates through polynomial regression tasks and large language models that task-aware pruning provides no improvement on in-distribution (ID) data but consistently enhances OOD performance. The authors attribute this to OOD inputs causing deviations in layerwise norm and pairwise-distance profiles from those observed on ID data, which represent a task-adapted geometry. Pruning removes layers that amplify these distortions, realigning OOD representations with the adapted geometry. Causal support is provided via controlled distribution shifts and residual-scaling interventions, with consistent results across model scales.

Significance. If validated, the geometric interpretation offers a principled explanation for why task-aware pruning aids OOD capability without harming ID performance. This could inform pruning strategies in deep learning, particularly for LLMs, by targeting layers based on representational distortion rather than heuristic importance scores. The use of controlled tasks and interventions adds rigor to the empirical findings.

major comments (2)

[Section 4 (Causal Evidence)] The residual-scaling and distribution shift interventions show that profile changes correlate with performance gains, but do not directly test whether forcing OOD profiles to match ID profiles (independent of pruning) would recover the OOD accuracy improvement. This leaves open the possibility that the benefits arise from discarding high-sensitivity layers rather than geometry realignment.
[Section 3 (Geometric Explanation)] The task-adapted geometry is defined empirically by ID profiles, and OOD deviations are labeled as distortion. However, without a quantitative measure or falsifiable prediction of how much deviation causes performance drop, the interpretation risks being post-hoc.

minor comments (2)

[Experimental Setup] The results would benefit from reporting error bars, multiple random seeds, and statistical tests to quantify the consistency of OOD improvements across runs.
[Notation] Clarify the exact computation of pairwise-distance profiles to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of our work's significance. We address each major comment below and are prepared to revise the manuscript to strengthen the causal evidence and add quantitative rigor to the geometric interpretation.

read point-by-point responses

Referee: [Section 4 (Causal Evidence)] The residual-scaling and distribution shift interventions show that profile changes correlate with performance gains, but do not directly test whether forcing OOD profiles to match ID profiles (independent of pruning) would recover the OOD accuracy improvement. This leaves open the possibility that the benefits arise from discarding high-sensitivity layers rather than geometry realignment.

Authors: We thank the referee for highlighting this distinction. The residual-scaling intervention adjusts the layer contributions to shift OOD representation profiles toward ID profiles without any layer removal, and we observe corresponding OOD accuracy gains. This provides evidence that geometry realignment contributes to the benefit beyond simply discarding sensitive layers. We acknowledge that a more direct profile-forcing method (e.g., via representation editing) would offer stronger isolation of the mechanism. In revision we will expand the discussion of this limitation and add a clarifying experiment if feasible. revision: partial
Referee: [Section 3 (Geometric Explanation)] The task-adapted geometry is defined empirically by ID profiles, and OOD deviations are labeled as distortion. However, without a quantitative measure or falsifiable prediction of how much deviation causes performance drop, the interpretation risks being post-hoc.

Authors: We agree that a quantitative distortion measure would strengthen the claim and reduce post-hoc risk. We will introduce a simple distortion metric based on the L2 deviation between OOD and ID layerwise norm and pairwise-distance profiles. In the revision we will demonstrate that this metric correlates with OOD performance drop across shifts and that task-aware pruning reduces the metric, yielding a falsifiable prediction: layers contributing most to distortion should be pruned for OOD gains. The existing multi-task consistency and intervention results already constrain the interpretation, but the added metric will make it more rigorous. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observations and interventions form self-contained chain

full rationale

The paper grounds its claims in direct empirical measurements: OOD inputs produce layerwise norm and pairwise-distance profiles that deviate from ID profiles, task-aware pruning yields no ID benefit but consistent OOD gains, and controlled interventions (distribution shifts, residual scaling) produce corresponding profile shifts and performance changes. The geometric interpretation is explicitly framed as a post-hoc explanation of these observed regularities rather than a deductive step that defines the target geometry in terms of the pruning outcome or vice versa. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the reference to TALE is presented as external prior work. Because the central result is a set of reproducible experimental patterns plus causal interventions that do not reduce to definitional identities, the chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The account depends on the empirical premise that ID profiles define a stable task-adapted geometry and that OOD deviations are distortions amenable to layer removal.

axioms (1)

domain assumption OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles
This deviation is presented as the observable signature of geometric distortion.

invented entities (1)

task-adapted geometry no independent evidence
purpose: Conceptual characterization of representation profiles observed on ID inputs
Serves as the reference state that pruning is claimed to restore for OOD inputs

pith-pipeline@v0.9.0 · 5735 in / 1127 out tokens · 33045 ms · 2026-05-22T10:06:36.524664+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 16 internal anchors

[1]

Layer by Layer: Uncovering Hidden Representations in Language Models

Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Hugging Face repository , howpublished =

Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. Hugging Face repository , howpublished =. 2024 , publisher =

work page 2024
[3]

TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination

TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination , author=. arXiv preprint arXiv:2510.22767 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

, note =

HuggingFaceH4 , title =. , note =

work page
[5]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021
[6]

2023 , eprint=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

work page 2023
[7]

2025 , eprint=

BlockPruner: Fine-grained Pruning for Large Language Models , author=. 2025 , eprint=

work page 2025
[8]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , author=. arXiv preprint arXiv:2410.05229 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2603.12228 , year=

Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights , author=. arXiv preprint arXiv:2603.12228 , year=

work page arXiv
[10]

, author=

Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language. , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

2023 , eprint =

A Simple and Effective Pruning Approach for Large Language Models , author =. 2023 , eprint =

work page 2023
[12]

2021 , eprint=

Zero Time Waste: Recycling Predictions in Early Exit Neural Networks , author=. 2021 , eprint=

work page 2021
[13]

2024 , eprint=

RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference , author=. 2024 , eprint=

work page 2024
[14]

, note =

Bo-Kyeong Kim1* Geonmin Kim1* Tae-Ho Kim1† Thibault Castells1 Shinkook Choi1 Junho Shin1 Hyoung-Kyu Song , title =. , note =

work page
[15]

arXiv preprint arXiv:2506.21103 , year=

Learning to Skip the Middle Layers of Transformers , author=. arXiv preprint arXiv:2506.21103 , year=

work page arXiv
[16]

Transactions of the Association for Computational Linguistics , volume=

A Survey on Model Compression for Large Language Models , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[17]

International conference on machine learning , pages=

Sparsegpt: Massive language models can be accurately pruned in one-shot , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[18]

German Conference on Artificial Intelligence (K

Re-examining learning linear functions in context , author=. German Conference on Artificial Intelligence (K. 2025 , organization=

work page 2025
[19]

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning , author=

work page
[20]

60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 , pages=

Structured Pruning Learns Compact and Accurate Models , author=. 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 , pages=. 2022 , organization=

work page 2022
[21]

arXiv preprint arXiv:1905.05950 , year=

BERT rediscovers the classical NLP pipeline , author=. arXiv preprint arXiv:1905.05950 , year=

work page arXiv 1905
[22]

arXiv preprint arXiv:2402.02834 , volume=

Shortened llama: A simple depth pruning for large language models , author=. arXiv preprint arXiv:2402.02834 , volume=

work page arXiv
[23]

The Twelfth International Conference on Learning Representations , year=

Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs , author=. The Twelfth International Conference on Learning Representations , year=

work page
[24]

arXiv preprint arXiv:2503.12294 , year=

The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation , author=. arXiv preprint arXiv:2503.12294 , year=

work page arXiv
[25]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Layer-wise Model Pruning based on Mutual Information , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[26]

Deep Variational Information Bottleneck

Deep variational information bottleneck , author=. arXiv preprint arXiv:1612.00410 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

The information bottleneck method

The information bottleneck method , author=. arXiv preprint physics/0004057 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

work page 1948
[29]

Fano , title =

Robert M. Fano , title =. 1961 , pages =

work page 1961
[30]

2015 ieee information theory workshop (itw) , pages=

Deep learning and the information bottleneck principle , author=. 2015 ieee information theory workshop (itw) , pages=. 2015 , organization=

work page 2015
[31]

Opening the Black Box of Deep Neural Networks via Information

Opening the black box of deep neural networks via information , author=. arXiv preprint arXiv:1703.00810 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Layer-wise neuron pruning using mutual information , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2021
[33]

arXiv preprint arXiv:2411.00147 , year=

Mutual Information Preserving Pruning (MIPP) , author=. arXiv preprint arXiv:2411.00147 , year=

work page arXiv
[34]

arXiv preprint arXiv:2003.08472 , year=

MINT: Mutual Information-based Neuron Trimming for DNN Compression , author=. arXiv preprint arXiv:2003.08472 , year=

work page arXiv 2003
[35]

Computational Linguistics , volume=

Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=

work page
[36]

International Conference on Machine Learning (ICML) , year=

Generalization bounds of information bottleneck for representation learning , author=. International Conference on Machine Learning (ICML) , year=

work page
[37]

Journal of Statistical Mechanics: Theory and Experiment , volume=

On the information bottleneck theory of deep learning , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2019 , publisher=

work page 2019
[38]

arXiv e-prints , pages=

MINE: mutual information neural estimation , author=. arXiv e-prints , pages=

work page
[39]

Advances in neural information processing systems , volume=

Optimal brain damage , author=. Advances in neural information processing systems , volume=

work page
[40]

IEEE international conference on neural networks , pages=

Optimal brain surgeon and general network pruning , author=. IEEE international conference on neural networks , pages=. 1993 , organization=

work page 1993
[41]

Advances in neural information processing systems , volume=

Learning both weights and connections for efficient neural network , author=. Advances in neural information processing systems , volume=

work page
[42]

Compression of Neural Machine Translation Models via Pruning

Compression of neural machine translation models via pruning , author=. arXiv preprint arXiv:1606.09274 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Advances in neural information processing systems , volume=

Evaluation beyond task performance: analyzing concepts in AlphaZero in Hex , author=. Advances in neural information processing systems , volume=

work page
[44]

arXiv preprint arXiv:2004.06499 , year=

What's so special about BERT's layers? A closer look at the NLP pipeline in monolingual and multilingual models , author=. arXiv preprint arXiv:2004.06499 , year=

work page arXiv 2004
[45]

Computer Speech & Language , volume=

On the effect of dropping layers of pre-trained transformer models , author=. Computer Speech & Language , volume=. 2023 , publisher=

work page 2023
[46]

arXiv preprint arXiv:2004.04010 , year=

Analyzing redundancy in pretrained transformer models , author=. arXiv preprint arXiv:2004.04010 , year=

work page arXiv 2004
[47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Fluctuation-based adaptive structured pruning for large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[48]

Proceedings of the 41st International Conference on Machine Learning , pages=

Outlier weighed layerwise sparsity (OWL) a missing secret sauce for pruning LLMs to high sparsity , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page
[49]

From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models

From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models , author=. arXiv preprint arXiv:2510.18030 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Pattern Recognition Letters , volume=

Greedy-layer pruning: Speeding up transformer models for natural language processing , author=. Pattern Recognition Letters , volume=. 2022 , publisher=

work page 2022
[51]

OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition , author=

work page
[52]

arXiv preprint arXiv:2310.06694 , year=

Sheared llama: Accelerating language model pre-training via structured pruning , author=. arXiv preprint arXiv:2310.06694 , year=

work page arXiv
[53]

Exploring Sparsity in Recurrent Neural Networks

Exploring sparsity in recurrent neural networks , author=. arXiv preprint arXiv:1704.05119 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

arXiv preprint arXiv:2502.07780 , year=

Darwinlm: Evolutionary structured pruning of large language models , author=. arXiv preprint arXiv:2502.07780 , year=

work page arXiv
[55]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[56]

International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations (ICLR) , year=

work page
[57]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , volume=

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , volume=. 2019 , publisher=

work page 2019
[58]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2018
[59]

Communications of the ACM , volume=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. Communications of the ACM , volume=. 2021 , doi=

work page 2021
[60]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark and Kenton Lee and Ming-Wei Chang and Tom Kwiatkowski and Michael Collins and Kristina Toutanova , booktitle =. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. 2019 , address =. doi:10.18653/v1/N19-1300 , pages =

work page doi:10.18653/v1/n19-1300 2019
[61]

Transactions on Machine Learning Research , year =

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author=. Transactions on Machine Learning Research , year =

work page
[62]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

International conference on machine learning , pages=

Compressing neural networks with the hashing trick , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[64]

Data-free parameter pruning for Deep Neural Networks

Data-free parameter pruning for deep neural networks. arXiv 2015 , author=. arXiv preprint arXiv:1507.06149 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2015
[65]

Pruning Filters for Efficient ConvNets

Pruning filters for efficient convnets , author=. arXiv preprint arXiv:1608.08710 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Proceedings of the IEEE international conference on computer vision , pages=

Channel pruning for accelerating very deep neural networks , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[67]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned , author=. arXiv preprint arXiv:1905.09418 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[68]

arXiv preprint arXiv:2109.04838 , year=

Block pruning for faster transformers , author=. arXiv preprint arXiv:2109.04838 , year=

work page arXiv
[69]

Shortgpt: Layers in large language models are more redundant than you expect

Shortgpt: Layers in large language models are more redundant than you expect , author=. arXiv preprint arXiv:2403.03853 , year=

work page arXiv
[70]

2023 , note=

E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity , author=. 2023 , note=

work page 2023
[71]

2023 , note=

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot , author=. 2023 , note=

work page 2023
[72]

2024 , note=

A Simple and Effective Pruning Approach for Large Language Models , author=. 2024 , note=

work page 2024
[73]

Advances in Neural Information Processing Systems , volume=

What can transformers learn in-context? a case study of simple function classes , author=. Advances in Neural Information Processing Systems , volume=

work page
[74]

What learning algorithm is in-context learning? Investigations with linear models

What learning algorithm is in-context learning? investigations with linear models , author=. arXiv preprint arXiv:2211.15661 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

arXiv preprint arXiv:2402.09025 , year=

Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks , author=. arXiv preprint arXiv:2402.09025 , year=

work page arXiv
[76]

Slicegpt: Compress large language models by deleting rows and columns

Slicegpt: Compress large language models by deleting rows and columns , author=. arXiv preprint arXiv:2401.15024 , year=

work page arXiv
[77]

A Simple and Effective Pruning Approach for Large Language Models

A simple and effective pruning approach for large language models , author=. arXiv preprint arXiv:2306.11695 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

Advances in Neural Information Processing Systems , volume=

Entropy and mutual information in models of deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[79]

Efficient Estimation of Mutual Information for Strongly Dependent Variables

Efficient estimation of mutual information for strongly dependent variables , author=. arXiv preprint arXiv:1411.2003 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003
[80]

2024 , isbn =

Wen, Jinyong , title =. 2024 , isbn =. doi:10.1145/3664647.3680682 , numpages =

work page doi:10.1145/3664647.3680682 2024

Showing first 80 references.

[1] [1]

Layer by Layer: Uncovering Hidden Representations in Language Models

Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Hugging Face repository , howpublished =

Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. Hugging Face repository , howpublished =. 2024 , publisher =

work page 2024

[3] [3]

TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination

TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination , author=. arXiv preprint arXiv:2510.22767 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

, note =

HuggingFaceH4 , title =. , note =

work page

[5] [5]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021

[6] [6]

2023 , eprint=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

work page 2023

[7] [7]

2025 , eprint=

BlockPruner: Fine-grained Pruning for Large Language Models , author=. 2025 , eprint=

work page 2025

[8] [8]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , author=. arXiv preprint arXiv:2410.05229 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2603.12228 , year=

Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights , author=. arXiv preprint arXiv:2603.12228 , year=

work page arXiv

[10] [10]

, author=

Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language. , author=. Advances in Neural Information Processing Systems , volume=

work page

[11] [11]

2023 , eprint =

A Simple and Effective Pruning Approach for Large Language Models , author =. 2023 , eprint =

work page 2023

[12] [12]

2021 , eprint=

Zero Time Waste: Recycling Predictions in Early Exit Neural Networks , author=. 2021 , eprint=

work page 2021

[13] [13]

2024 , eprint=

RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference , author=. 2024 , eprint=

work page 2024

[14] [14]

, note =

Bo-Kyeong Kim1* Geonmin Kim1* Tae-Ho Kim1† Thibault Castells1 Shinkook Choi1 Junho Shin1 Hyoung-Kyu Song , title =. , note =

work page

[15] [15]

arXiv preprint arXiv:2506.21103 , year=

Learning to Skip the Middle Layers of Transformers , author=. arXiv preprint arXiv:2506.21103 , year=

work page arXiv

[16] [16]

Transactions of the Association for Computational Linguistics , volume=

A Survey on Model Compression for Large Language Models , author=. Transactions of the Association for Computational Linguistics , volume=

work page

[17] [17]

International conference on machine learning , pages=

Sparsegpt: Massive language models can be accurately pruned in one-shot , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[18] [18]

German Conference on Artificial Intelligence (K

Re-examining learning linear functions in context , author=. German Conference on Artificial Intelligence (K. 2025 , organization=

work page 2025

[19] [19]

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning , author=

work page

[20] [20]

60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 , pages=

Structured Pruning Learns Compact and Accurate Models , author=. 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 , pages=. 2022 , organization=

work page 2022

[21] [21]

arXiv preprint arXiv:1905.05950 , year=

BERT rediscovers the classical NLP pipeline , author=. arXiv preprint arXiv:1905.05950 , year=

work page arXiv 1905

[22] [22]

arXiv preprint arXiv:2402.02834 , volume=

Shortened llama: A simple depth pruning for large language models , author=. arXiv preprint arXiv:2402.02834 , volume=

work page arXiv

[23] [23]

The Twelfth International Conference on Learning Representations , year=

Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs , author=. The Twelfth International Conference on Learning Representations , year=

work page

[24] [24]

arXiv preprint arXiv:2503.12294 , year=

The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation , author=. arXiv preprint arXiv:2503.12294 , year=

work page arXiv

[25] [25]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Layer-wise Model Pruning based on Mutual Information , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021

[26] [26]

Deep Variational Information Bottleneck

Deep variational information bottleneck , author=. arXiv preprint arXiv:1612.00410 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

The information bottleneck method

The information bottleneck method , author=. arXiv preprint physics/0004057 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

work page 1948

[29] [29]

Fano , title =

Robert M. Fano , title =. 1961 , pages =

work page 1961

[30] [30]

2015 ieee information theory workshop (itw) , pages=

Deep learning and the information bottleneck principle , author=. 2015 ieee information theory workshop (itw) , pages=. 2015 , organization=

work page 2015

[31] [31]

Opening the Black Box of Deep Neural Networks via Information

Opening the black box of deep neural networks via information , author=. arXiv preprint arXiv:1703.00810 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Layer-wise neuron pruning using mutual information , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2021

[33] [33]

arXiv preprint arXiv:2411.00147 , year=

Mutual Information Preserving Pruning (MIPP) , author=. arXiv preprint arXiv:2411.00147 , year=

work page arXiv

[34] [34]

arXiv preprint arXiv:2003.08472 , year=

MINT: Mutual Information-based Neuron Trimming for DNN Compression , author=. arXiv preprint arXiv:2003.08472 , year=

work page arXiv 2003

[35] [35]

Computational Linguistics , volume=

Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=

work page

[36] [36]

International Conference on Machine Learning (ICML) , year=

Generalization bounds of information bottleneck for representation learning , author=. International Conference on Machine Learning (ICML) , year=

work page

[37] [37]

Journal of Statistical Mechanics: Theory and Experiment , volume=

On the information bottleneck theory of deep learning , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2019 , publisher=

work page 2019

[38] [38]

arXiv e-prints , pages=

MINE: mutual information neural estimation , author=. arXiv e-prints , pages=

work page

[39] [39]

Advances in neural information processing systems , volume=

Optimal brain damage , author=. Advances in neural information processing systems , volume=

work page

[40] [40]

IEEE international conference on neural networks , pages=

Optimal brain surgeon and general network pruning , author=. IEEE international conference on neural networks , pages=. 1993 , organization=

work page 1993

[41] [41]

Advances in neural information processing systems , volume=

Learning both weights and connections for efficient neural network , author=. Advances in neural information processing systems , volume=

work page

[42] [42]

Compression of Neural Machine Translation Models via Pruning

Compression of neural machine translation models via pruning , author=. arXiv preprint arXiv:1606.09274 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Advances in neural information processing systems , volume=

Evaluation beyond task performance: analyzing concepts in AlphaZero in Hex , author=. Advances in neural information processing systems , volume=

work page

[44] [44]

arXiv preprint arXiv:2004.06499 , year=

What's so special about BERT's layers? A closer look at the NLP pipeline in monolingual and multilingual models , author=. arXiv preprint arXiv:2004.06499 , year=

work page arXiv 2004

[45] [45]

Computer Speech & Language , volume=

On the effect of dropping layers of pre-trained transformer models , author=. Computer Speech & Language , volume=. 2023 , publisher=

work page 2023

[46] [46]

arXiv preprint arXiv:2004.04010 , year=

Analyzing redundancy in pretrained transformer models , author=. arXiv preprint arXiv:2004.04010 , year=

work page arXiv 2004

[47] [47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Fluctuation-based adaptive structured pruning for large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[48] [48]

Proceedings of the 41st International Conference on Machine Learning , pages=

Outlier weighed layerwise sparsity (OWL) a missing secret sauce for pruning LLMs to high sparsity , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page

[49] [49]

From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models

From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models , author=. arXiv preprint arXiv:2510.18030 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Pattern Recognition Letters , volume=

Greedy-layer pruning: Speeding up transformer models for natural language processing , author=. Pattern Recognition Letters , volume=. 2022 , publisher=

work page 2022

[51] [51]

OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition , author=

work page

[52] [52]

arXiv preprint arXiv:2310.06694 , year=

Sheared llama: Accelerating language model pre-training via structured pruning , author=. arXiv preprint arXiv:2310.06694 , year=

work page arXiv

[53] [53]

Exploring Sparsity in Recurrent Neural Networks

Exploring sparsity in recurrent neural networks , author=. arXiv preprint arXiv:1704.05119 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

arXiv preprint arXiv:2502.07780 , year=

Darwinlm: Evolutionary structured pruning of large language models , author=. arXiv preprint arXiv:2502.07780 , year=

work page arXiv

[55] [55]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021

[56] [56]

International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations (ICLR) , year=

work page

[57] [57]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , volume=

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , volume=. 2019 , publisher=

work page 2019

[58] [58]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2018

[59] [59]

Communications of the ACM , volume=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. Communications of the ACM , volume=. 2021 , doi=

work page 2021

[60] [60]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark and Kenton Lee and Ming-Wei Chang and Tom Kwiatkowski and Michael Collins and Kristina Toutanova , booktitle =. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. 2019 , address =. doi:10.18653/v1/N19-1300 , pages =

work page doi:10.18653/v1/n19-1300 2019

[61] [61]

Transactions on Machine Learning Research , year =

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author=. Transactions on Machine Learning Research , year =

work page

[62] [62]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

International conference on machine learning , pages=

Compressing neural networks with the hashing trick , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015

[64] [64]

Data-free parameter pruning for Deep Neural Networks

Data-free parameter pruning for deep neural networks. arXiv 2015 , author=. arXiv preprint arXiv:1507.06149 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2015

[65] [65]

Pruning Filters for Efficient ConvNets

Pruning filters for efficient convnets , author=. arXiv preprint arXiv:1608.08710 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

Proceedings of the IEEE international conference on computer vision , pages=

Channel pruning for accelerating very deep neural networks , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

[67] [67]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned , author=. arXiv preprint arXiv:1905.09418 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905

[68] [68]

arXiv preprint arXiv:2109.04838 , year=

Block pruning for faster transformers , author=. arXiv preprint arXiv:2109.04838 , year=

work page arXiv

[69] [69]

Shortgpt: Layers in large language models are more redundant than you expect

Shortgpt: Layers in large language models are more redundant than you expect , author=. arXiv preprint arXiv:2403.03853 , year=

work page arXiv

[70] [70]

2023 , note=

E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity , author=. 2023 , note=

work page 2023

[71] [71]

2023 , note=

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot , author=. 2023 , note=

work page 2023

[72] [72]

2024 , note=

A Simple and Effective Pruning Approach for Large Language Models , author=. 2024 , note=

work page 2024

[73] [73]

Advances in Neural Information Processing Systems , volume=

What can transformers learn in-context? a case study of simple function classes , author=. Advances in Neural Information Processing Systems , volume=

work page

[74] [74]

What learning algorithm is in-context learning? Investigations with linear models

What learning algorithm is in-context learning? investigations with linear models , author=. arXiv preprint arXiv:2211.15661 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[75] [75]

arXiv preprint arXiv:2402.09025 , year=

Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks , author=. arXiv preprint arXiv:2402.09025 , year=

work page arXiv

[76] [76]

Slicegpt: Compress large language models by deleting rows and columns

Slicegpt: Compress large language models by deleting rows and columns , author=. arXiv preprint arXiv:2401.15024 , year=

work page arXiv

[77] [77]

A Simple and Effective Pruning Approach for Large Language Models

A simple and effective pruning approach for large language models , author=. arXiv preprint arXiv:2306.11695 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[78] [78]

Advances in Neural Information Processing Systems , volume=

Entropy and mutual information in models of deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page

[79] [79]

Efficient Estimation of Mutual Information for Strongly Dependent Variables

Efficient estimation of mutual information for strongly dependent variables , author=. arXiv preprint arXiv:1411.2003 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003

[80] [80]

2024 , isbn =

Wen, Jinyong , title =. 2024 , isbn =. doi:10.1145/3664647.3680682 , numpages =

work page doi:10.1145/3664647.3680682 2024