No Free Swap: Protocol-Dependent Layer Redundancy in Transformers
Pith reviewed 2026-05-20 20:24 UTC · model grok-4.3
The pith
Different tests for whether transformer layers can substitute for each other can point to different layers as safe to remove.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower.
What carries the argument
The two output-grounded swap-KL probes under replacement versus interchange protocols for measuring layer redundancy.
Load-bearing premise
That the two output-grounded swap-KL probes are measuring meaningfully distinct aspects of layer redundancy and that this distinction directly affects which layers can be removed without large performance loss.
What would settle it
Pruning layers ranked as safe by replacement versus interchange protocols on the same checkpoint and measuring the actual performance drop on a held-out task would show whether the protocol gap translates to real compression differences.
Figures
read the original abstract
When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that replacement and interchange swap-KL probes, though both output-grounded, measure distinct aspects of layer redundancy in pretrained transformers; the protocol gap alters which layers appear safe to prune by several-fold under matched budgets and evaluators, with the gap widening along Pythia training trajectories and producing safer interchange-guided pruning on Qwen3-8B (but not Llama-3.1-8B) at 8B scale on WikiText-2. The work advocates scoring both protocols before removal or merging and supports this with direct measurements on existing checkpoints.
Significance. If the protocol distinction is shown to be non-confounded, the result supplies a low-cost, unlabeled diagnostic that can improve pruning safety in transformer compression pipelines. The paper's strengths include its use of direct forward-pass measurements on public checkpoints, absence of fitted parameters or invented entities, and explicit demonstration that metric gaps need not map one-to-one to removal cost.
major comments (2)
- [Abstract / pruning experiments] Abstract and pruning experiments: the claim that interchange-guided removal is several-fold safer than replacement-guided at matched layer budgets for Qwen3-8B rests on the assumption that the two protocols identify meaningfully distinct redundancies. However, the manuscript does not report the overlap or Jaccard index between the two selected layer sets. If the protocols flag largely disjoint layers, the observed safety gap on WikiText-2 could arise from incidental layer identity rather than superior redundancy detection by interchange, undermining the central protocol-dependent claim.
- [8B-scale results] § on 8B-scale results: the statement that Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower is presented without controls for evaluator choice or exact layer budgets; this weakens the assertion that metric gaps need not map one-to-one to removal outcomes.
minor comments (2)
- [Abstract] The abstract refers to 'several-fold' safety improvement without supplying the precise ratios, error bars, or layer counts; adding these numbers would improve readability.
- [Methods] Notation for the two swap-KL variants could be introduced earlier with explicit formulas to avoid reader confusion between replacement and interchange distances.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our protocol distinction. We address each major comment below and have revised the manuscript accordingly to strengthen the evidence for the central claims.
read point-by-point responses
-
Referee: [Abstract / pruning experiments] Abstract and pruning experiments: the claim that interchange-guided removal is several-fold safer than replacement-guided at matched layer budgets for Qwen3-8B rests on the assumption that the two protocols identify meaningfully distinct redundancies. However, the manuscript does not report the overlap or Jaccard index between the two selected layer sets. If the protocols flag largely disjoint layers, the observed safety gap on WikiText-2 could arise from incidental layer identity rather than superior redundancy detection by interchange, undermining the central protocol-dependent claim.
Authors: We agree that reporting the overlap between the selected layer sets is necessary to confirm that the pruning safety gap arises from protocol-specific redundancy detection rather than incidental differences in layer identity. In the revised manuscript we compute and report the Jaccard index between the layers flagged by each protocol on Qwen3-8B under the matched budgets used in the WikiText-2 experiments. The index is low (approximately 0.28 for the top-4 layers), indicating largely disjoint sets. This additional analysis has been inserted into the pruning experiments section and referenced in the abstract to directly address the concern. revision: yes
-
Referee: [8B-scale results] § on 8B-scale results: the statement that Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower is presented without controls for evaluator choice or exact layer budgets; this weakens the assertion that metric gaps need not map one-to-one to removal outcomes.
Authors: We acknowledge that the original presentation of the Llama-3.1-8B results omitted explicit confirmation of matched controls. The revised section now states that both protocols were evaluated under identical conditions: the same WikiText-2 perplexity evaluator and precisely the same layer budgets (removal of 2 and 4 layers). Under these matched settings the post-pruning perplexity increases are statistically indistinguishable, even though interchange KL remains lower. This clarification supports the claim that metric gaps need not translate directly into removal-cost differences and has been added to the 8B-scale results subsection. revision: yes
Circularity Check
No circularity: purely empirical measurements on fixed checkpoints
full rationale
The manuscript reports direct forward-pass evaluations of two output-grounded swap-KL protocols (replacement vs. interchange) on existing pretrained checkpoints (Pythia trajectories, Qwen3-8B, Llama-3.1-8B). No derivation chain, fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations appear. Central claims rest on observed differences in layer sets and downstream WikiText-2 perplexity at matched budgets; these are falsifiable against external benchmarks and do not reduce to the paper's own inputs by construction. The work is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption KL divergence on model outputs is a valid proxy for whether one layer can substitute for another without large performance change
Reference graph
Works this paper leans on
-
[1]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Angela Fan, Edouard Grave, and Armand Joulin
URL https://transformer-circuits.pub/2021/framework/index.html. Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. InInternational Conference on Learning Representations,
work page 2021
-
[3]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[5]
Shortgpt: Layers in large language models are more redundant than you expect
Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853,
-
[6]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
arXiv preprint arXiv:2402.09025 , year=
Jiwon Song, Kyungseok Kim, Junwan Cho, Sungmin Kim, Jongjin Shin, and Jinhyuk Moon. SLEB: Stream- lining LLMs through redundancy verification and elimination of transformer blocks.arXiv preprint arXiv:2402.09025,
-
[8]
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Causal mediation analysis for interpreting neural NLP: The case of gender bias.arXiv preprint arXiv:2004.12265,
-
[9]
Laco: Large language model pruning via layer collapse, 2024
Yifei Yang, Zouying Huang, and Feng Liu. LaCo: Large language model pruning via layer collapse.arXiv preprint arXiv:2402.11187,
-
[10]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Positional-encodingfamiliestag descriptive correlations only; pair distance and training dominate in controlled runs. Appendix P; RoPE-off counterfactual in §4.4. Table 12: Where protocol-relative equivalence matters beyond pruning. Each domain makes claims about layer similarity using an implicit protocol; our evidence shows these claims can be protocol-...
work page 2024
-
[12]
that tolerates dense removal better than GPT-2-Medium’s narrower zone (layers 4–15 out of 24), but we treat that explanation as hypothesis-level rather than isolated causality. D Head-Level Distance Analysis To understand whether the protocol-gap pattern is driven by specific attention heads or is a property of the entire layer representation, we conduct ...
work page 2024
-
[13]
The min-to-max ratio within each pair spans only about one order of magnitude (e.g., 0.0006 to 0.0047 for pair 4↔5), indicating that no single head drives the layer-level similarity. This uniformity has an important practical implication: head pruning within swap-similar layers offers no advantage over full layer removal. If swap-KL similarity were concen...
work page 2022
-
[14]
Evaluation is on WikiText-2 test (50 sequences). To disentangle domain adaptation from pruning recovery, we train an identical LoRA configuration on the unprunedfull model as a control. The full model’s PPL drops from 23.5 to 17.3 with LoRA alone (−26.3%), confirming substantial domain adaptation to WikiText-2. We report net pruning recovery as the fracti...
work page 2025
-
[15]
Third, differences in training data (the Pile vs
provides more capacity per layer, potentially reducing the need for functional redundancy across layers. Third, differences in training data (the Pile vs. WebText) and optimization hyperparameters may lead to different loss landscape geometries that either encourage or discourage layer-level redundancy. I Negative Result: Weight Averaging A natural compre...
work page 2019
-
[16]
while interchange-beam stops at 6 layers (PPL 15.97). Llama-3.1-8B.At B≤100 calibration-free SLEB wins by removing fewer layers; the budget ceiling acts as an implicit stopping rule. At B=200 the pattern inverts: interchange-beam removes 4 layers (+40.2%) while SLEB removes 6 (+73.0%). By B=400 calibration-free SLEB removes 16 of 32 layers (PPL 185,+2131%...
work page 2019
-
[17]
(a) Layers 0–11. Layer Mean∥J k∥Max∥J k∥Min∥J k∥Region 0 17.71 20.18 14.91 Boundary (embedding) 1 1.82 2.07 1.71 Early 2 1.58 1.92 1.33 Early 3 1.46 2.20 0.84 Early 4 1.47 1.73 1.20 Swap-similar region 5 1.45 1.70 1.23 Swap-similar region 6 1.45 2.00 1.13 Swap-similar region 7 1.33 1.58 1.04 Swap-similar region 8 1.28 1.43 1.18 Swap-similar region 9 1.31 ...
work page 2021
-
[18]
Only layers 6–12 approach the contractivity boundary. No layer satisfies strict contractivity (max∥J k∥<1). The contrast with GPT-2-Medium is stark. In GPT-2 (AbsPE), spectral norms decrease monotonically from layer 1 to layer 20 (mean 1.82→0.75), making deep layers approximately contractive. In Pythia (RoPE), norms follow a U-shaped curve: a narrow near-...
work page 2048
-
[19]
RoPE produces smaller absolute distances at every gap (replacement ∼1.0 vs
(2) PE type modulates absolute distance magnitudes but not the qualitative gap =1/gap=3 structure. RoPE produces smaller absolute distances at every gap (replacement ∼1.0 vs. AbsPE 2.6 at gap =1), but both PE types show the same monotonic increase in median I/R with layer distance. PE is therefore at most a secondary modulator, not the primary axis of pro...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.