pith. machine review for the scientific record. sign in

arxiv: 2603.20991 · v2 · submitted 2026-03-22 · 💻 cs.LG · cs.AI· cs.CL· cs.LO

Recognition: 2 theorem links

· Lean Theorem

Structural Sensitivity in Compressed Transformers: Relative Error Propagation and Layer Removal

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.LO
keywords transformer compressionerror propagationlayer pruningsparsity schedulingactivation-aware pruningrepresentation driftdepth pruning
0
0 comments X

The pith

Errors introduced when compressing a transformer layer are multiplied by a measurable factor rho at every later layer, so early compression produces larger final drift than late compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines rho as the ratio of output error to input error at each transformer layer and shows that the product of rho values from any layer onward accurately forecasts how much the model's final representations will drift under compression. This product explains the observed greater damage from pruning early layers and the advantage of depth-decreasing sparsity schedules over uniform ones. Within individual layers, naive pruning exhibits extreme sensitivity variation that activation-aware methods shrink dramatically, while rho distance from one supplies a two-pass ranking for layer removal that beats prior heuristics.

Core claim

Across six models from 117M to 8B parameters, the product of per-layer rho values predicts representation drift with Spearman correlation -0.44. This contraction profile supplies a training-free rule for choosing which weights to prune inside a layer and which entire layers to delete, delivering lower perplexity than ShortGPT at eight layers removed and a 1.22x wall-clock speedup from physical deletion.

What carries the argument

rho, the ratio of a layer's output error norm to its input error norm, whose running product from layer t onward predicts the scale of downstream representation change.

If this is right

  • Compressing an early layer changes final representations more than compressing a late layer of equal size.
  • Sparsity schedules that remove more from later layers outperform uniform sparsity at the same average compression rate.
  • Ranking layers for removal by how far each rho sits from one requires two forward passes and yields 1.6x lower perplexity than Block Influence at eight layers deleted.
  • Activation-aware pruning reduces the 600-fold spread in per-component sensitivity to 3-7x and reverses importance rankings across architectures.
  • A blend of rho ranking and secondary criteria reaches perplexity 14.2 and 60 percent downstream accuracy on LLaMA-2-7B after eight layers removed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Width and internal redundancy appear to set an overall tolerance for rho variation, so wider models may tolerate more aggressive early compression than narrow ones.
  • Iterating the rho measurement after each pruning round could further reduce drift by adapting to the changed distribution.
  • The same product-of-ratios logic may apply to pruning in non-transformer sequence models whose layers also act as successive operators.

Load-bearing premise

The rho values computed on the original uncompressed model remain a reliable predictor of error growth once weights have been pruned or layers removed.

What would settle it

After applying rho-guided pruning or layer removal, recompute the actual representation drift on the compressed model and test whether it still equals the product of the original rho values.

Figures

Figures reproduced from arXiv: 2603.20991 by Abhinaba Basu, Koushik Deb, Kumkum Basu.

Figure 1
Figure 1. Figure 1: Overview. (a) Three composable approximation stages, each with machine [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structural sensitivity analysis. (a) Per-group compression regret on log scale [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The early-layer MLP catastrophe. (a) Layer 0 component decomposition across [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative compression: forward (L0→L11) vs. backward (L11→L0). The 25,716× gap at one layer compressed demonstrates that layer 0 dominates the sensitivity landscape. This projects out the mean and radial components, with spectral radius ∥JLN∥2 ≤ 1/σ—perturbations aligned with the current activation direction are contracted. If each layer’s effective gain is less than unity, bounded per￾layer perturbations… view at source ↗
Figure 5
Figure 5. Figure 5: Contraction decomposition. (a) GPT-2 Small: blue bars show hidden state [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-layer contraction factor ρt across five architectures on a normalized depth axis. GPT-2 Small (blue) contracts everywhere (ρ < 1). LFM2 (green), Mistral (orange), and Qwen3 (red) all cross the stability boundary, with GPT-2 Medium (purple) spiking to ρ = 2.05 at the final layer. and architecture-specific redundancy. Three factors drive robustness at scale: (1) effective rank scales with dimension (pert… view at source ↗
Figure 7
Figure 7. Figure 7: Greedy sequential compression waterfall. Each bar adds one compression group [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Compressing transformer weights makes large language models cheaper to deploy. But each layer's compression introduces an error. These errors accumulate as the signal passes through later layers, and how they accumulate is not well understood. We measure this directly: at each layer, we take the ratio of output to input error, calling it rho. A value below one means the layer absorbs the error; above one means it grows. Computing rho on six transformers (117M to 8B parameters) yields three findings. (i) Errors at layer t scale downstream by the product of later rho values, predicting representation drift (Spearman r = -0.44, p < 10^-4). This explains why compressing early layers hurts more than late ones, and why depth-decreasing sparsity schedules outperform uniform ones. Across architecture families, however, model width and redundancy matter more than rho alone. (ii) Within a layer, naive pruning shows a ~600x spread in component sensitivity. Activation-aware pruning (Wanda) shrinks this to 3-7x; the ranking reverses across architectures, so fixed importance scores do not transfer. (iii) For depth pruning, ranking layers by how far rho is from one takes two forward passes. It beats ShortGPT's Block Influence with 1.6x lower perplexity at eight layers removed, and physical deletion delivers 1.22x wall-clock speed-up. A blend of the two criteria does best (perplexity 14.2, 60.0% downstream accuracy on LLaMA-2-7B). Twelve Lean 4 norm inequalities provide machine-checked per-matrix error bounds. The contraction profile thus gives a training-free instrument for two decisions: where to compress within layers, and which to remove.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces rho as the per-layer relative error propagation factor, defined as the ratio of output to input error in transformer forward passes. It claims that downstream errors scale by the product of subsequent rho values, which correlates with representation drift (Spearman r = -0.44, p < 10^{-4}) across models from 117M to 8B parameters. This is used to explain why early-layer compression is more detrimental and why depth-decreasing sparsity outperforms uniform. The work also analyzes intra-layer pruning sensitivity (showing Wanda reduces spread to 3-7x), proposes rho-distance ranking for layer removal that beats ShortGPT (1.6x lower perplexity at 8 layers removed), and provides 12 machine-checked Lean 4 norm inequalities for error bounds.

Significance. Should the rho predictor prove robust to compression-induced distribution shifts, the results would provide a practical, training-free framework for deciding compression locations within and across layers. The machine-verified bounds and concrete empirical correlations (including 1.22x speed-up from physical deletion) add rigor and applicability to model compression research in large language models.

major comments (2)
  1. [Finding (i)] The central predictive claim in finding (i)—that errors at layer t scale downstream by the product of later rho values and predict representation drift—relies on rho measured from the uncompressed model remaining valid after pruning or layer removal. The reported Spearman correlation appears derived from original-model forward passes; a direct post-compression validation (predicted product vs. observed drift after weight changes) is needed, as non-linear layers and distribution shifts may alter effective propagation.
  2. [Depth pruning experiments] The depth-pruning results claim rho-based ranking (distance from 1) outperforms ShortGPT with 1.6x lower perplexity at eight layers removed. However, without explicit quantification of the ranking metric or confirmation that the original rho product correlates with actual post-removal drift, the mechanistic grounding for early-layer sensitivity and schedule superiority is not fully load-bearing.
minor comments (2)
  1. [Abstract] The abstract states 'Twelve Lean 4 norm inequalities provide machine-checked per-matrix error bounds' but gives no indication of which specific norms (e.g., operator vs. Frobenius) or matrices are bounded; a short reference to the relevant theorem or section would aid clarity.
  2. [Experimental setup] Reproducibility details such as exact error-norm definitions for rho, data splits used for drift measurements, and whether error bars reflect multiple runs are not specified in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key opportunities to strengthen the validation of rho under compression. We agree that direct post-compression checks would improve the manuscript and will add them. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: The central predictive claim in finding (i)—that errors at layer t scale downstream by the product of later rho values and predict representation drift—relies on rho measured from the uncompressed model remaining valid after pruning or layer removal. The reported Spearman correlation appears derived from original-model forward passes; a direct post-compression validation (predicted product vs. observed drift after weight changes) is needed, as non-linear layers and distribution shifts may alter effective propagation.

    Authors: We clarify that the reported Spearman correlation (r = -0.44) is computed using rho values from the uncompressed models, correlating the product of subsequent rho values with representation drift measured in the original forward passes. This establishes the error scaling relationship in the base setting. In the compression experiments, we apply the original rho values as a ranking heuristic for pruning and layer removal decisions, with the resulting performance gains (lower perplexity and higher downstream accuracy) serving as empirical support for its practical utility. We acknowledge that a direct post-compression validation—comparing the predicted product from original rho against observed drift after weight changes—would provide stronger mechanistic evidence under distribution shifts. We will add this analysis to the revised manuscript. revision: yes

  2. Referee: The depth-pruning results claim rho-based ranking (distance from 1) outperforms ShortGPT with 1.6x lower perplexity at eight layers removed. However, without explicit quantification of the ranking metric or confirmation that the original rho product correlates with actual post-removal drift, the mechanistic grounding for early-layer sensitivity and schedule superiority is not fully load-bearing.

    Authors: The ranking metric is quantified explicitly as the absolute distance |rho - 1| per layer (detailed in Section 4.3), with layers farthest from 1 removed first. This yields the reported 1.6x lower perplexity at eight layers removed on LLaMA-2-7B compared to ShortGPT. The mechanistic grounding rests on the error propagation findings, where deviation from rho = 1 indicates amplification or absorption. We agree that confirming the correlation between the original rho product for removed layers and post-removal observed drift would strengthen the explanation. We will include this comparison in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; rho measurements and Lean bounds are independent

full rationale

The paper defines rho directly as the measured ratio of output to input error from forward passes on the uncompressed model. The downstream scaling claim follows as a direct multiplicative consequence of these per-layer ratios under the observed propagation, then validated by an independent Spearman correlation (r = -0.44) against separately measured representation drift. This correlation is an empirical test, not a quantity forced by construction or by fitting rho to the drift target. The twelve Lean 4 norm inequalities supply machine-checked formal bounds on per-matrix errors that do not depend on the empirical rho values, the correlation, or any post-compression data. No self-citations appear as load-bearing premises, no uniqueness theorems are imported from prior author work, and no ansatz or known result is renamed as a new derivation. The central claims therefore rest on direct measurement plus independent formal verification rather than reducing to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the definition of rho as a multiplicative factor and on standard matrix-norm inequalities for error bounds; no free parameters or new physical entities are introduced.

axioms (1)
  • standard math Matrix norm inequalities bound per-layer error propagation
    Twelve Lean 4 norm inequalities provide machine-checked per-matrix error bounds.
invented entities (1)
  • rho (relative error propagation factor) no independent evidence
    purpose: Quantify per-layer amplification of compression error
    Defined directly from forward-pass measurements; no independent falsifiable prediction outside the reported experiments.

pith-pipeline@v0.9.0 · 5636 in / 1257 out tokens · 41012 ms · 2026-05-15T06:25:34.559249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Coquelicot: A user-friendly library of real analysis for Coq

    Boldo, S., Lelay, C., Melquiond, G., 2015. Coquelicot: A user-friendly library of real analysis for Coq. Mathematics in Computer Science 9, 41–62

  2. [2]

    Certified adversarial robust- ness via randomized smoothing, in: ICML

    Cohen, J., Rosenfeld, E., Kolter, Z., 2019. Certified adversarial robust- ness via randomized smoothing, in: ICML

  3. [3]

    GPT3.int8(): 8-bit matrix multiplication for transformers at scale, in: NeurIPS

    Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L., 2022. GPT3.int8(): 8-bit matrix multiplication for transformers at scale, in: NeurIPS

  4. [4]

    HAWQ: Hessian aware quantization of neural networks with mixed- precision, in: ICCV

    Dong, Z., Yao, Z., Gholami, A., Mahoney, M.W., Keutzer, K., 2019. HAWQ: Hessian aware quantization of neural networks with mixed- precision, in: ICCV. 25

  5. [5]

    SparseGPT: Massive language models can be accurately pruned in one-shot, in: ICML

    Frantar, E., Alistarh, D., 2023. SparseGPT: Massive language models can be accurately pruned in one-shot, in: ICML

  6. [6]

    GPTQ: Accu- rate post-training quantization for generative pre-trained transformers, in: ICLR

    Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D., 2023. GPTQ: Accu- rate post-training quantization for generative pre-trained transformers, in: ICLR

  7. [7]

    A framework for few-shot language model evaluation

    Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golber, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A., 2024. A framework for few-shot language model evaluation. URL:https://zen...

  8. [8]

    All optimal Hankel-norm approximations of linear multivariable systems and theirL∞-error bounds

    Glover, K., 1984. All optimal Hankel-norm approximations of linear multivariable systems and theirL∞-error bounds. International Journal of Control 39, 1115–1193

  9. [9]

    Stable architectures for deep neural networks

    Haber, E., Ruthotto, L., 2018. Stable architectures for deep neural networks. Inverse Problems 34, 014004

  10. [10]

    Language model compression with weighted low-rank factorization, in: ICLR

    Hsu, Y.C., Hua, T., Chang, S.E., Lou, Q., Shen, Y., Jin, H., 2022. Language model compression with weighted low-rank factorization, in: ICLR

  11. [11]

    Reluplex: An efficient SMT solver for verifying deep neural networks, in: CAV

    Katz, G., Barrett, C., Dill, D.L., Julian, K., Kochenderfer, M.J., 2017. Reluplex: An efficient SMT solver for verifying deep neural networks, in: CAV

  12. [12]

    Nonlinear Systems

    Khalil, H.K., 2002. Nonlinear Systems. 3rd ed., Prentice Hall

  13. [13]

    Learned token pruning for transformers, in: KDD

    Kim, S., Shen, S., Thorsley, D., Gholami, A., Kwon, W., Hassoun, J., Keutzer, K., 2022. Learned token pruning for transformers, in: KDD

  14. [14]

    Fast inference from trans- formers via speculative decoding, in: ICML

    Leviathan, Y., Kalman, M., Matias, Y., 2023. Fast inference from trans- formers via speculative decoding, in: ICML

  15. [15]

    Pointer sentinel mixture models, in: ICLR

    Merity, S., Xiong, C., Bradbury, J., Socher, R., 2017. Pointer sentinel mixture models, in: ICLR. 26

  16. [16]

    Principal component analysis in linear systems: Controllability, observability, and model reduction

    Moore, B.C., 1981. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Transactions on Automatic Control 26, 17–32

  17. [17]

    The Lean 4 theorem prover and pro- gramming language, in: CADE

    de Moura, L., Ullrich, S., 2021. The Lean 4 theorem prover and pro- gramming language, in: CADE

  18. [18]

    Efficiently scaling transformer inference, in: MLSys

    Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Lev- skaya, A., Heek, J., Xiao, K., Agrawal, S., Dean, J., 2023. Efficiently scaling transformer inference, in: MLSys

  19. [19]

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.,

  20. [20]

    OpenAI Blog

    Language models are unsupervised multitask learners. OpenAI Blog

  21. [21]

    Exploring the limits of transfer learn- ing with a unified text-to-text transformer

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., 2020. Exploring the limits of transfer learn- ing with a unified text-to-text transformer. Journal of Machine Learning Research 21, 1–67

  22. [22]

    Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., Smith, N.A.,

  23. [23]

    The right tool for the job: Matching model and instance complex- ities, in: ACL

  24. [24]

    Developing bug-free machine learning systems with formal mathematics, in: ICML

    Selsam, D., Liang, P., Dill, D.L., 2017. Developing bug-free machine learning systems with formal mathematics, in: ICML

  25. [25]

    Fast Transformer Decoding: One Write-Head is All You Need

    Shazeer, N., 2019. Fast transformer decoding: One write-head is all you need. arXiv:1911.02150

  26. [26]

    A simple and effective pruning approach for large language models, in: ICLR

    Sun, M., Liu, Z., Bair, A., Kolter, J.Z., 2024. A simple and effective pruning approach for large language models, in: ICLR

  27. [27]

    Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., He, Y.,

  28. [28]

    ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers, in: NeurIPS

  29. [29]

    Outlierweighedlayerwisesparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity, in: ICML

    Yin, L., Wu, Y., Zhang, Z., Hsieh, C.Y., Wang, Y., Jia, Y., Pechenizkiy, M., Liang, Y., Wang, Z., Liu, S., 2024. Outlierweighedlayerwisesparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity, in: ICML. 27 Appendix A. Lean 4 Formal Proofs Allninetheoremsareimplementedinasinglefile,LivingInference.lean, availableinthesupplementarymaterial(...