arxiv: 2603.20991 · v2 · submitted 2026-03-22 · 💻 cs.LG · cs.AI· cs.CL· cs.LO

Recognition: 2 theorem links

· Lean Theorem

Structural Sensitivity in Compressed Transformers: Relative Error Propagation and Layer Removal

Abhinaba Basu , Kumkum Basu , Koushik Deb

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.LO

keywords transformer compressionerror propagationlayer pruningsparsity schedulingactivation-aware pruningrepresentation driftdepth pruning

0 comments

The pith

Errors introduced when compressing a transformer layer are multiplied by a measurable factor rho at every later layer, so early compression produces larger final drift than late compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines rho as the ratio of output error to input error at each transformer layer and shows that the product of rho values from any layer onward accurately forecasts how much the model's final representations will drift under compression. This product explains the observed greater damage from pruning early layers and the advantage of depth-decreasing sparsity schedules over uniform ones. Within individual layers, naive pruning exhibits extreme sensitivity variation that activation-aware methods shrink dramatically, while rho distance from one supplies a two-pass ranking for layer removal that beats prior heuristics.

Core claim

Across six models from 117M to 8B parameters, the product of per-layer rho values predicts representation drift with Spearman correlation -0.44. This contraction profile supplies a training-free rule for choosing which weights to prune inside a layer and which entire layers to delete, delivering lower perplexity than ShortGPT at eight layers removed and a 1.22x wall-clock speedup from physical deletion.

What carries the argument

rho, the ratio of a layer's output error norm to its input error norm, whose running product from layer t onward predicts the scale of downstream representation change.

If this is right

Compressing an early layer changes final representations more than compressing a late layer of equal size.
Sparsity schedules that remove more from later layers outperform uniform sparsity at the same average compression rate.
Ranking layers for removal by how far each rho sits from one requires two forward passes and yields 1.6x lower perplexity than Block Influence at eight layers deleted.
Activation-aware pruning reduces the 600-fold spread in per-component sensitivity to 3-7x and reverses importance rankings across architectures.
A blend of rho ranking and secondary criteria reaches perplexity 14.2 and 60 percent downstream accuracy on LLaMA-2-7B after eight layers removed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Width and internal redundancy appear to set an overall tolerance for rho variation, so wider models may tolerate more aggressive early compression than narrow ones.
Iterating the rho measurement after each pruning round could further reduce drift by adapting to the changed distribution.
The same product-of-ratios logic may apply to pruning in non-transformer sequence models whose layers also act as successive operators.

Load-bearing premise

The rho values computed on the original uncompressed model remain a reliable predictor of error growth once weights have been pruned or layers removed.

What would settle it

After applying rho-guided pruning or layer removal, recompute the actual representation drift on the compressed model and test whether it still equals the product of the original rho values.

Figures

Figures reproduced from arXiv: 2603.20991 by Abhinaba Basu, Koushik Deb, Kumkum Basu.

**Figure 2.** Figure 2: Structural sensitivity analysis. (a) Per-group compression regret on log scale [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The early-layer MLP catastrophe. (a) Layer 0 component decomposition across [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative compression: forward (L0→L11) vs. backward (L11→L0). The 25,716× gap at one layer compressed demonstrates that layer 0 dominates the sensitivity landscape. This projects out the mean and radial components, with spectral radius ∥JLN∥2 ≤ 1/σ—perturbations aligned with the current activation direction are contracted. If each layer’s effective gain is less than unity, bounded perlayer perturbations… view at source ↗

**Figure 5.** Figure 5: Contraction decomposition. (a) GPT-2 Small: blue bars show hidden state [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Per-layer contraction factor ρt across five architectures on a normalized depth axis. GPT-2 Small (blue) contracts everywhere (ρ < 1). LFM2 (green), Mistral (orange), and Qwen3 (red) all cross the stability boundary, with GPT-2 Medium (purple) spiking to ρ = 2.05 at the final layer. and architecture-specific redundancy. Three factors drive robustness at scale: (1) effective rank scales with dimension (pert… view at source ↗

**Figure 7.** Figure 7: Greedy sequential compression waterfall. Each bar adds one compression group [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Compressing transformer weights makes large language models cheaper to deploy. But each layer's compression introduces an error. These errors accumulate as the signal passes through later layers, and how they accumulate is not well understood. We measure this directly: at each layer, we take the ratio of output to input error, calling it rho. A value below one means the layer absorbs the error; above one means it grows. Computing rho on six transformers (117M to 8B parameters) yields three findings. (i) Errors at layer t scale downstream by the product of later rho values, predicting representation drift (Spearman r = -0.44, p < 10^-4). This explains why compressing early layers hurts more than late ones, and why depth-decreasing sparsity schedules outperform uniform ones. Across architecture families, however, model width and redundancy matter more than rho alone. (ii) Within a layer, naive pruning shows a ~600x spread in component sensitivity. Activation-aware pruning (Wanda) shrinks this to 3-7x; the ranking reverses across architectures, so fixed importance scores do not transfer. (iii) For depth pruning, ranking layers by how far rho is from one takes two forward passes. It beats ShortGPT's Block Influence with 1.6x lower perplexity at eight layers removed, and physical deletion delivers 1.22x wall-clock speed-up. A blend of the two criteria does best (perplexity 14.2, 60.0% downstream accuracy on LLaMA-2-7B). Twelve Lean 4 norm inequalities provide machine-checked per-matrix error bounds. The contraction profile thus gives a training-free instrument for two decisions: where to compress within layers, and which to remove.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rho gives a usable training-free ranking for both intra-layer pruning and depth reduction, with some Lean-backed bounds, but the claim that pre-pruning rho still predicts post-pruning error scaling rests on an assumption that may not hold.

read the letter

Rho is the ratio of output error to input error at each layer. The paper measures it on six models and shows that the product of later rhos correlates with representation drift at Spearman r = -0.44. That correlation lines up with the observation that early-layer compression hurts more, and it leads to a simple rule for picking which layers to drop. They also report that activation-aware pruning shrinks the sensitivity spread inside layers from roughly 600x down to 3-7x, and that ranking layers by distance from rho = 1 beats ShortGPT on perplexity when eight layers are removed. The twelve Lean 4 norm inequalities supply machine-checked per-matrix bounds that do not depend on the empirical fits. Those pieces are the actual additions: a multiplicative propagation view plus a cheap way to apply it to two common compression decisions. The numbers are concrete and the formal checks add credibility that most pruning papers skip. The soft spot is the assumption that rho measured on the intact model remains a good multiplier after weights are pruned or layers are deleted. Pruning changes the activation distributions that feed the next non-linear blocks, so the original product may no longer track the actual drift. The reported correlation is only moderate, and the abstract gives perplexity and accuracy deltas without error bars or full data-split details, which makes the head-to-head gains harder to judge. This work is aimed at people who compress large transformers for inference and want heuristics that avoid retraining. A practitioner who needs quick layer-ranking rules or a sanity check on component importance will get usable takeaways. It is not a foundational result, but the combination of the metric, the comparisons, and the Lean bounds is concrete enough to deserve referee time. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces rho as the per-layer relative error propagation factor, defined as the ratio of output to input error in transformer forward passes. It claims that downstream errors scale by the product of subsequent rho values, which correlates with representation drift (Spearman r = -0.44, p < 10^{-4}) across models from 117M to 8B parameters. This is used to explain why early-layer compression is more detrimental and why depth-decreasing sparsity outperforms uniform. The work also analyzes intra-layer pruning sensitivity (showing Wanda reduces spread to 3-7x), proposes rho-distance ranking for layer removal that beats ShortGPT (1.6x lower perplexity at 8 layers removed), and provides 12 machine-checked Lean 4 norm inequalities for error bounds.

Significance. Should the rho predictor prove robust to compression-induced distribution shifts, the results would provide a practical, training-free framework for deciding compression locations within and across layers. The machine-verified bounds and concrete empirical correlations (including 1.22x speed-up from physical deletion) add rigor and applicability to model compression research in large language models.

major comments (2)

[Finding (i)] The central predictive claim in finding (i)—that errors at layer t scale downstream by the product of later rho values and predict representation drift—relies on rho measured from the uncompressed model remaining valid after pruning or layer removal. The reported Spearman correlation appears derived from original-model forward passes; a direct post-compression validation (predicted product vs. observed drift after weight changes) is needed, as non-linear layers and distribution shifts may alter effective propagation.
[Depth pruning experiments] The depth-pruning results claim rho-based ranking (distance from 1) outperforms ShortGPT with 1.6x lower perplexity at eight layers removed. However, without explicit quantification of the ranking metric or confirmation that the original rho product correlates with actual post-removal drift, the mechanistic grounding for early-layer sensitivity and schedule superiority is not fully load-bearing.

minor comments (2)

[Abstract] The abstract states 'Twelve Lean 4 norm inequalities provide machine-checked per-matrix error bounds' but gives no indication of which specific norms (e.g., operator vs. Frobenius) or matrices are bounded; a short reference to the relevant theorem or section would aid clarity.
[Experimental setup] Reproducibility details such as exact error-norm definitions for rho, data splits used for drift measurements, and whether error bars reflect multiple runs are not specified in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key opportunities to strengthen the validation of rho under compression. We agree that direct post-compression checks would improve the manuscript and will add them. Our point-by-point responses follow.

read point-by-point responses

Referee: The central predictive claim in finding (i)—that errors at layer t scale downstream by the product of later rho values and predict representation drift—relies on rho measured from the uncompressed model remaining valid after pruning or layer removal. The reported Spearman correlation appears derived from original-model forward passes; a direct post-compression validation (predicted product vs. observed drift after weight changes) is needed, as non-linear layers and distribution shifts may alter effective propagation.

Authors: We clarify that the reported Spearman correlation (r = -0.44) is computed using rho values from the uncompressed models, correlating the product of subsequent rho values with representation drift measured in the original forward passes. This establishes the error scaling relationship in the base setting. In the compression experiments, we apply the original rho values as a ranking heuristic for pruning and layer removal decisions, with the resulting performance gains (lower perplexity and higher downstream accuracy) serving as empirical support for its practical utility. We acknowledge that a direct post-compression validation—comparing the predicted product from original rho against observed drift after weight changes—would provide stronger mechanistic evidence under distribution shifts. We will add this analysis to the revised manuscript. revision: yes
Referee: The depth-pruning results claim rho-based ranking (distance from 1) outperforms ShortGPT with 1.6x lower perplexity at eight layers removed. However, without explicit quantification of the ranking metric or confirmation that the original rho product correlates with actual post-removal drift, the mechanistic grounding for early-layer sensitivity and schedule superiority is not fully load-bearing.

Authors: The ranking metric is quantified explicitly as the absolute distance |rho - 1| per layer (detailed in Section 4.3), with layers farthest from 1 removed first. This yields the reported 1.6x lower perplexity at eight layers removed on LLaMA-2-7B compared to ShortGPT. The mechanistic grounding rests on the error propagation findings, where deviation from rho = 1 indicates amplification or absorption. We agree that confirming the correlation between the original rho product for removed layers and post-removal observed drift would strengthen the explanation. We will include this comparison in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; rho measurements and Lean bounds are independent

full rationale

The paper defines rho directly as the measured ratio of output to input error from forward passes on the uncompressed model. The downstream scaling claim follows as a direct multiplicative consequence of these per-layer ratios under the observed propagation, then validated by an independent Spearman correlation (r = -0.44) against separately measured representation drift. This correlation is an empirical test, not a quantity forced by construction or by fitting rho to the drift target. The twelve Lean 4 norm inequalities supply machine-checked formal bounds on per-matrix errors that do not depend on the empirical rho values, the correlation, or any post-compression data. No self-citations appear as load-bearing premises, no uniqueness theorems are imported from prior author work, and no ansatz or known result is renamed as a new derivation. The central claims therefore rest on direct measurement plus independent formal verification rather than reducing to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the definition of rho as a multiplicative factor and on standard matrix-norm inequalities for error bounds; no free parameters or new physical entities are introduced.

axioms (1)

standard math Matrix norm inequalities bound per-layer error propagation
Twelve Lean 4 norm inequalities provide machine-checked per-matrix error bounds.

invented entities (1)

rho (relative error propagation factor) no independent evidence
purpose: Quantify per-layer amplification of compression error
Defined directly from forward-pass measurements; no independent falsifiable prediction outside the reported experiments.

pith-pipeline@v0.9.0 · 5636 in / 1257 out tokens · 41012 ms · 2026-05-15T06:25:34.559249+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Errors at layer t scale downstream by the product of later rho values... Lyapunov V(t) = ||e_t||^2 / ||h_t||^2 ... geometric series bound
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ten machine-checked Lean 4 theorems formalize per-matrix error bounds... Theorem 8: geometric series bound when rho<1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Coquelicot: A user-friendly library of real analysis for Coq

Boldo, S., Lelay, C., Melquiond, G., 2015. Coquelicot: A user-friendly library of real analysis for Coq. Mathematics in Computer Science 9, 41–62

work page 2015
[2]

Certified adversarial robust- ness via randomized smoothing, in: ICML

Cohen, J., Rosenfeld, E., Kolter, Z., 2019. Certified adversarial robust- ness via randomized smoothing, in: ICML

work page 2019
[3]

GPT3.int8(): 8-bit matrix multiplication for transformers at scale, in: NeurIPS

Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L., 2022. GPT3.int8(): 8-bit matrix multiplication for transformers at scale, in: NeurIPS

work page 2022
[4]

HAWQ: Hessian aware quantization of neural networks with mixed- precision, in: ICCV

Dong, Z., Yao, Z., Gholami, A., Mahoney, M.W., Keutzer, K., 2019. HAWQ: Hessian aware quantization of neural networks with mixed- precision, in: ICCV. 25

work page 2019
[5]

SparseGPT: Massive language models can be accurately pruned in one-shot, in: ICML

Frantar, E., Alistarh, D., 2023. SparseGPT: Massive language models can be accurately pruned in one-shot, in: ICML

work page 2023
[6]

GPTQ: Accu- rate post-training quantization for generative pre-trained transformers, in: ICLR

Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D., 2023. GPTQ: Accu- rate post-training quantization for generative pre-trained transformers, in: ICLR

work page 2023
[7]

A framework for few-shot language model evaluation

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golber, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A., 2024. A framework for few-shot language model evaluation. URL:https://zen...

work page doi:10.5281/zenodo 2024
[8]

All optimal Hankel-norm approximations of linear multivariable systems and theirL∞-error bounds

Glover, K., 1984. All optimal Hankel-norm approximations of linear multivariable systems and theirL∞-error bounds. International Journal of Control 39, 1115–1193

work page 1984
[9]

Stable architectures for deep neural networks

Haber, E., Ruthotto, L., 2018. Stable architectures for deep neural networks. Inverse Problems 34, 014004

work page 2018
[10]

Language model compression with weighted low-rank factorization, in: ICLR

Hsu, Y.C., Hua, T., Chang, S.E., Lou, Q., Shen, Y., Jin, H., 2022. Language model compression with weighted low-rank factorization, in: ICLR

work page 2022
[11]

Reluplex: An efficient SMT solver for verifying deep neural networks, in: CAV

Katz, G., Barrett, C., Dill, D.L., Julian, K., Kochenderfer, M.J., 2017. Reluplex: An efficient SMT solver for verifying deep neural networks, in: CAV

work page 2017
[12]

Nonlinear Systems

Khalil, H.K., 2002. Nonlinear Systems. 3rd ed., Prentice Hall

work page 2002
[13]

Learned token pruning for transformers, in: KDD

Kim, S., Shen, S., Thorsley, D., Gholami, A., Kwon, W., Hassoun, J., Keutzer, K., 2022. Learned token pruning for transformers, in: KDD

work page 2022
[14]

Fast inference from trans- formers via speculative decoding, in: ICML

Leviathan, Y., Kalman, M., Matias, Y., 2023. Fast inference from trans- formers via speculative decoding, in: ICML

work page 2023
[15]

Pointer sentinel mixture models, in: ICLR

Merity, S., Xiong, C., Bradbury, J., Socher, R., 2017. Pointer sentinel mixture models, in: ICLR. 26

work page 2017
[16]

Principal component analysis in linear systems: Controllability, observability, and model reduction

Moore, B.C., 1981. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Transactions on Automatic Control 26, 17–32

work page 1981
[17]

The Lean 4 theorem prover and pro- gramming language, in: CADE

de Moura, L., Ullrich, S., 2021. The Lean 4 theorem prover and pro- gramming language, in: CADE

work page 2021
[18]

Efficiently scaling transformer inference, in: MLSys

Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Lev- skaya, A., Heek, J., Xiao, K., Agrawal, S., Dean, J., 2023. Efficiently scaling transformer inference, in: MLSys

work page 2023
[19]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.,

work page
[20]

OpenAI Blog

Language models are unsupervised multitask learners. OpenAI Blog

work page
[21]

Exploring the limits of transfer learn- ing with a unified text-to-text transformer

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., 2020. Exploring the limits of transfer learn- ing with a unified text-to-text transformer. Journal of Machine Learning Research 21, 1–67

work page 2020
[22]

Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., Smith, N.A.,

work page
[23]

The right tool for the job: Matching model and instance complex- ities, in: ACL

work page
[24]

Developing bug-free machine learning systems with formal mathematics, in: ICML

Selsam, D., Liang, P., Dill, D.L., 2017. Developing bug-free machine learning systems with formal mathematics, in: ICML

work page 2017
[25]

Fast Transformer Decoding: One Write-Head is All You Need

Shazeer, N., 2019. Fast transformer decoding: One write-head is all you need. arXiv:1911.02150

work page internal anchor Pith review Pith/arXiv arXiv 2019
[26]

A simple and effective pruning approach for large language models, in: ICLR

Sun, M., Liu, Z., Bair, A., Kolter, J.Z., 2024. A simple and effective pruning approach for large language models, in: ICLR

work page 2024
[27]

Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., He, Y.,

work page
[28]

ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers, in: NeurIPS

work page
[29]

Outlierweighedlayerwisesparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity, in: ICML

Yin, L., Wu, Y., Zhang, Z., Hsieh, C.Y., Wang, Y., Jia, Y., Pechenizkiy, M., Liang, Y., Wang, Z., Liu, S., 2024. Outlierweighedlayerwisesparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity, in: ICML. 27 Appendix A. Lean 4 Formal Proofs Allninetheoremsareimplementedinasinglefile,LivingInference.lean, availableinthesupplementarymaterial(...

work page 2024