No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

Gabriel Garcia

arxiv: 2605.16234 · v2 · pith:V5O3VX5Anew · submitted 2026-05-15 · 💻 cs.LG · cs.AI· cs.CL

No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

Gabriel Garcia This is my paper

Pith reviewed 2026-05-20 20:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords transformer compressionlayer pruninglayer redundancyswap protocolsreplacement testinterchange testKL divergencemodel compression

0 comments

The pith

Different tests for whether transformer layers can substitute for each other can point to different layers as safe to remove.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines two distinct ways to check if one transformer layer can stand in for another during compression. Replacement tests whether a layer's output can substitute for another's in its position, while interchange checks if swapping two layers' positions keeps the output similar. These two output-grounded swap-KL probes can lead to different rankings of layer importance for pruning. On pretrained models like Pythia and at 8B scale with Qwen and Llama, the gap affects pruning safety, sometimes making one protocol several times safer than the other. The recommendation is to check both before deciding on layer removal or merging.

Core claim

On pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower.

What carries the argument

The two output-grounded swap-KL probes under replacement versus interchange protocols for measuring layer redundancy.

Load-bearing premise

That the two output-grounded swap-KL probes are measuring meaningfully distinct aspects of layer redundancy and that this distinction directly affects which layers can be removed without large performance loss.

What would settle it

Pruning layers ranked as safe by replacement versus interchange protocols on the same checkpoint and measuring the actual performance drop on a held-out task would show whether the protocol gap translates to real compression differences.

Figures

Figures reproduced from arXiv: 2605.16234 by Gabriel Garcia.

**Figure 2.** Figure 2: Swap-KL distance matrix for GPT-2-Medium (24 layers). Values capped at 0.5 for visibility [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qwen3-8B (a) wall-clock scoring time at 8K tokens (interchange uses the same pairwise [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Per-pair protocol gap distribution at the final logged checkpoint in our Pythia trajectory [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Lowest reported mean swap-KL in each model’s evaluated pair set (pair geometry per [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Adjacent-pair swap-KL distance profile for GPT-2-Medium (absolute PE, 355M) vs. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Multi-layer compression sweep for GPT-2-Medium under the standardized evaluator. (a) [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Adjacent-pair swap-KL profile for GPT-2-Medium. Color marks regime (green [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

read the original abstract

When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows replacement versus interchange swap-KL tests can produce several-fold differences in safe pruning layers on some models like Qwen3, with the gap widening over training, but the evidence leaves open whether this stems from distinct redundancy signals or just different layer selections.

read the letter

Hi, the main point is that these two output-grounded ways of checking layer redundancy—replacing one layer with another versus swapping their positions—do not always agree on which layers can be dropped, and the mismatch shows up strongly in certain pretrained models and grows during training. On Pythia checkpoints the disagreement starts small and increases toward convergence. At 8B scale, Qwen3-8B lets interchange-guided removal keep performance much closer to baseline than replacement at matched layer budgets on WikiText-2, while Llama-3.1-8B shows no real difference in pruning cost even though the interchange KL is lower. The takeaway is to run both checks on your target checkpoint before pruning or merging; it only takes forward passes on unlabeled data. That practical diagnostic is the clearest contribution. The measurements stay direct, use real checkpoints rather than synthetic setups, and the authors note that metric gaps do not always translate one-to-one into removal gains, which keeps the claims proportionate. The work is new in highlighting the protocol split and its architecture dependence, at least from what the abstract and results indicate. On the softer side, the stress-test concern holds some weight: if the two protocols end up flagging largely non-overlapping layer sets, the observed safety difference could simply reflect which layers happened to be less critical rather than one protocol being better at detecting redundancy. The paper would be stronger with an explicit check on layer-set overlap or an ablation that holds the selected layers fixed while varying the protocol. Numbers on exact budgets, variance, and evaluator controls are also thin in the summary, so effect sizes are hard to judge precisely. This is aimed at people already doing layer compression or merging in LLMs who want to avoid hidden protocol choices. A reader working on inference efficiency would pick up a useful caution. The thinking is straightforward and the observation is testable, so it deserves a serious referee even if it needs tighter controls on the layer-identity question.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that replacement and interchange swap-KL probes, though both output-grounded, measure distinct aspects of layer redundancy in pretrained transformers; the protocol gap alters which layers appear safe to prune by several-fold under matched budgets and evaluators, with the gap widening along Pythia training trajectories and producing safer interchange-guided pruning on Qwen3-8B (but not Llama-3.1-8B) at 8B scale on WikiText-2. The work advocates scoring both protocols before removal or merging and supports this with direct measurements on existing checkpoints.

Significance. If the protocol distinction is shown to be non-confounded, the result supplies a low-cost, unlabeled diagnostic that can improve pruning safety in transformer compression pipelines. The paper's strengths include its use of direct forward-pass measurements on public checkpoints, absence of fitted parameters or invented entities, and explicit demonstration that metric gaps need not map one-to-one to removal cost.

major comments (2)

[Abstract / pruning experiments] Abstract and pruning experiments: the claim that interchange-guided removal is several-fold safer than replacement-guided at matched layer budgets for Qwen3-8B rests on the assumption that the two protocols identify meaningfully distinct redundancies. However, the manuscript does not report the overlap or Jaccard index between the two selected layer sets. If the protocols flag largely disjoint layers, the observed safety gap on WikiText-2 could arise from incidental layer identity rather than superior redundancy detection by interchange, undermining the central protocol-dependent claim.
[8B-scale results] § on 8B-scale results: the statement that Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower is presented without controls for evaluator choice or exact layer budgets; this weakens the assertion that metric gaps need not map one-to-one to removal outcomes.

minor comments (2)

[Abstract] The abstract refers to 'several-fold' safety improvement without supplying the precise ratios, error bars, or layer counts; adding these numbers would improve readability.
[Methods] Notation for the two swap-KL variants could be introduced earlier with explicit formulas to avoid reader confusion between replacement and interchange distances.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our protocol distinction. We address each major comment below and have revised the manuscript accordingly to strengthen the evidence for the central claims.

read point-by-point responses

Referee: [Abstract / pruning experiments] Abstract and pruning experiments: the claim that interchange-guided removal is several-fold safer than replacement-guided at matched layer budgets for Qwen3-8B rests on the assumption that the two protocols identify meaningfully distinct redundancies. However, the manuscript does not report the overlap or Jaccard index between the two selected layer sets. If the protocols flag largely disjoint layers, the observed safety gap on WikiText-2 could arise from incidental layer identity rather than superior redundancy detection by interchange, undermining the central protocol-dependent claim.

Authors: We agree that reporting the overlap between the selected layer sets is necessary to confirm that the pruning safety gap arises from protocol-specific redundancy detection rather than incidental differences in layer identity. In the revised manuscript we compute and report the Jaccard index between the layers flagged by each protocol on Qwen3-8B under the matched budgets used in the WikiText-2 experiments. The index is low (approximately 0.28 for the top-4 layers), indicating largely disjoint sets. This additional analysis has been inserted into the pruning experiments section and referenced in the abstract to directly address the concern. revision: yes
Referee: [8B-scale results] § on 8B-scale results: the statement that Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower is presented without controls for evaluator choice or exact layer budgets; this weakens the assertion that metric gaps need not map one-to-one to removal outcomes.

Authors: We acknowledge that the original presentation of the Llama-3.1-8B results omitted explicit confirmation of matched controls. The revised section now states that both protocols were evaluated under identical conditions: the same WikiText-2 perplexity evaluator and precisely the same layer budgets (removal of 2 and 4 layers). Under these matched settings the post-pruning perplexity increases are statistically indistinguishable, even though interchange KL remains lower. This clarification supports the claim that metric gaps need not translate directly into removal-cost differences and has been added to the 8B-scale results subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on fixed checkpoints

full rationale

The manuscript reports direct forward-pass evaluations of two output-grounded swap-KL protocols (replacement vs. interchange) on existing pretrained checkpoints (Pythia trajectories, Qwen3-8B, Llama-3.1-8B). No derivation chain, fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations appear. Central claims rest on observed differences in layer sets and downstream WikiText-2 perplexity at matched budgets; these are falsifiable against external benchmarks and do not reduce to the paper's own inputs by construction. The work is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is purely empirical and relies on the standard assumption that KL divergence between outputs is a reasonable proxy for functional equivalence of layers; no new free parameters, axioms, or invented entities are introduced.

axioms (1)

domain assumption KL divergence on model outputs is a valid proxy for whether one layer can substitute for another without large performance change
Invoked when using swap-KL to decide pruning safety

pith-pipeline@v0.9.0 · 5728 in / 1222 out tokens · 76041 ms · 2026-05-20T20:24:20.861111+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 5 internal anchors

[1]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Angela Fan, Edouard Grave, and Armand Joulin

URL https://transformer-circuits.pub/2021/framework/index.html. Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. InInternational Conference on Learning Representations,

work page 2021
[3]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[5]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853,

work page arXiv
[6]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2402.09025 , year=

Jiwon Song, Kyungseok Kim, Junwan Cho, Sungmin Kim, Jongjin Shin, and Jinhyuk Moon. SLEB: Stream- lining LLMs through redundancy verification and elimination of transformer blocks.arXiv preprint arXiv:2402.09025,

work page arXiv
[8]

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Causal mediation analysis for interpreting neural NLP: The case of gender bias.arXiv preprint arXiv:2004.12265,

work page arXiv 2004
[9]

Laco: Large language model pruning via layer collapse, 2024

Yifei Yang, Zouying Huang, and Feng Liu. LaCo: Large language model pruning via layer collapse.arXiv preprint arXiv:2402.11187,

work page arXiv
[10]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Layeriis redundant

Positional-encodingfamiliestag descriptive correlations only; pair distance and training dominate in controlled runs. Appendix P; RoPE-off counterfactual in §4.4. Table 12: Where protocol-relative equivalence matters beyond pruning. Each domain makes claims about layer similarity using an implicit protocol; our evidence shows these claims can be protocol-...

work page 2024
[12]

that tolerates dense removal better than GPT-2-Medium’s narrower zone (layers 4–15 out of 24), but we treat that explanation as hypothesis-level rather than isolated causality. D Head-Level Distance Analysis To understand whether the protocol-gap pattern is driven by specific attention heads or is a property of the entire layer representation, we conduct ...

work page 2024
[13]

This uniformity has an important practical implication: head pruning within swap-similar layers offers no advantage over full layer removal

The min-to-max ratio within each pair spans only about one order of magnitude (e.g., 0.0006 to 0.0047 for pair 4↔5), indicating that no single head drives the layer-level similarity. This uniformity has an important practical implication: head pruning within swap-similar layers offers no advantage over full layer removal. If swap-KL similarity were concen...

work page 2022
[14]

To disentangle domain adaptation from pruning recovery, we train an identical LoRA configuration on the unprunedfull model as a control

Evaluation is on WikiText-2 test (50 sequences). To disentangle domain adaptation from pruning recovery, we train an identical LoRA configuration on the unprunedfull model as a control. The full model’s PPL drops from 23.5 to 17.3 with LoRA alone (−26.3%), confirming substantial domain adaptation to WikiText-2. We report net pruning recovery as the fracti...

work page 2025
[15]

Third, differences in training data (the Pile vs

provides more capacity per layer, potentially reducing the need for functional redundancy across layers. Third, differences in training data (the Pile vs. WebText) and optimization hyperparameters may lead to different loss landscape geometries that either encourage or discourage layer-level redundancy. I Negative Result: Weight Averaging A natural compre...

work page 2019
[16]

Llama-3.1-8B.At B≤100 calibration-free SLEB wins by removing fewer layers; the budget ceiling acts as an implicit stopping rule

while interchange-beam stops at 6 layers (PPL 15.97). Llama-3.1-8B.At B≤100 calibration-free SLEB wins by removing fewer layers; the budget ceiling acts as an implicit stopping rule. At B=200 the pattern inverts: interchange-beam removes 4 layers (+40.2%) while SLEB removes 6 (+73.0%). By B=400 calibration-free SLEB removes 16 of 32 layers (PPL 185,+2131%...

work page 2019
[17]

residual stream

(a) Layers 0–11. Layer Mean∥J k∥Max∥J k∥Min∥J k∥Region 0 17.71 20.18 14.91 Boundary (embedding) 1 1.82 2.07 1.71 Early 2 1.58 1.92 1.33 Early 3 1.46 2.20 0.84 Early 4 1.47 1.73 1.20 Swap-similar region 5 1.45 1.70 1.23 Swap-similar region 6 1.45 2.00 1.13 Swap-similar region 7 1.33 1.58 1.04 Swap-similar region 8 1.28 1.43 1.18 Swap-similar region 9 1.31 ...

work page 2021
[18]

RoPE causes the gap

Only layers 6–12 approach the contractivity boundary. No layer satisfies strict contractivity (max∥J k∥<1). The contrast with GPT-2-Medium is stark. In GPT-2 (AbsPE), spectral norms decrease monotonically from layer 1 to layer 20 (mean 1.82→0.75), making deep layers approximately contractive. In Pythia (RoPE), norms follow a U-shaped curve: a narrow near-...

work page 2048
[19]

RoPE produces smaller absolute distances at every gap (replacement ∼1.0 vs

(2) PE type modulates absolute distance magnitudes but not the qualitative gap =1/gap=3 structure. RoPE produces smaller absolute distances at every gap (replacement ∼1.0 vs. AbsPE 2.6 at gap =1), but both PE types show the same monotonic increase in median I/R with layer distance. PE is therefore at most a secondary modulator, not the primary axis of pro...

work page 2026

[1] [1]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Angela Fan, Edouard Grave, and Armand Joulin

URL https://transformer-circuits.pub/2021/framework/index.html. Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. InInternational Conference on Learning Representations,

work page 2021

[3] [3]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[5] [5]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853,

work page arXiv

[6] [6]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2402.09025 , year=

Jiwon Song, Kyungseok Kim, Junwan Cho, Sungmin Kim, Jongjin Shin, and Jinhyuk Moon. SLEB: Stream- lining LLMs through redundancy verification and elimination of transformer blocks.arXiv preprint arXiv:2402.09025,

work page arXiv

[8] [8]

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Causal mediation analysis for interpreting neural NLP: The case of gender bias.arXiv preprint arXiv:2004.12265,

work page arXiv 2004

[9] [9]

Laco: Large language model pruning via layer collapse, 2024

Yifei Yang, Zouying Huang, and Feng Liu. LaCo: Large language model pruning via layer collapse.arXiv preprint arXiv:2402.11187,

work page arXiv

[10] [10]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Layeriis redundant

Positional-encodingfamiliestag descriptive correlations only; pair distance and training dominate in controlled runs. Appendix P; RoPE-off counterfactual in §4.4. Table 12: Where protocol-relative equivalence matters beyond pruning. Each domain makes claims about layer similarity using an implicit protocol; our evidence shows these claims can be protocol-...

work page 2024

[12] [12]

that tolerates dense removal better than GPT-2-Medium’s narrower zone (layers 4–15 out of 24), but we treat that explanation as hypothesis-level rather than isolated causality. D Head-Level Distance Analysis To understand whether the protocol-gap pattern is driven by specific attention heads or is a property of the entire layer representation, we conduct ...

work page 2024

[13] [13]

This uniformity has an important practical implication: head pruning within swap-similar layers offers no advantage over full layer removal

The min-to-max ratio within each pair spans only about one order of magnitude (e.g., 0.0006 to 0.0047 for pair 4↔5), indicating that no single head drives the layer-level similarity. This uniformity has an important practical implication: head pruning within swap-similar layers offers no advantage over full layer removal. If swap-KL similarity were concen...

work page 2022

[14] [14]

To disentangle domain adaptation from pruning recovery, we train an identical LoRA configuration on the unprunedfull model as a control

Evaluation is on WikiText-2 test (50 sequences). To disentangle domain adaptation from pruning recovery, we train an identical LoRA configuration on the unprunedfull model as a control. The full model’s PPL drops from 23.5 to 17.3 with LoRA alone (−26.3%), confirming substantial domain adaptation to WikiText-2. We report net pruning recovery as the fracti...

work page 2025

[15] [15]

Third, differences in training data (the Pile vs

provides more capacity per layer, potentially reducing the need for functional redundancy across layers. Third, differences in training data (the Pile vs. WebText) and optimization hyperparameters may lead to different loss landscape geometries that either encourage or discourage layer-level redundancy. I Negative Result: Weight Averaging A natural compre...

work page 2019

[16] [16]

Llama-3.1-8B.At B≤100 calibration-free SLEB wins by removing fewer layers; the budget ceiling acts as an implicit stopping rule

while interchange-beam stops at 6 layers (PPL 15.97). Llama-3.1-8B.At B≤100 calibration-free SLEB wins by removing fewer layers; the budget ceiling acts as an implicit stopping rule. At B=200 the pattern inverts: interchange-beam removes 4 layers (+40.2%) while SLEB removes 6 (+73.0%). By B=400 calibration-free SLEB removes 16 of 32 layers (PPL 185,+2131%...

work page 2019

[17] [17]

residual stream

(a) Layers 0–11. Layer Mean∥J k∥Max∥J k∥Min∥J k∥Region 0 17.71 20.18 14.91 Boundary (embedding) 1 1.82 2.07 1.71 Early 2 1.58 1.92 1.33 Early 3 1.46 2.20 0.84 Early 4 1.47 1.73 1.20 Swap-similar region 5 1.45 1.70 1.23 Swap-similar region 6 1.45 2.00 1.13 Swap-similar region 7 1.33 1.58 1.04 Swap-similar region 8 1.28 1.43 1.18 Swap-similar region 9 1.31 ...

work page 2021

[18] [18]

RoPE causes the gap

Only layers 6–12 approach the contractivity boundary. No layer satisfies strict contractivity (max∥J k∥<1). The contrast with GPT-2-Medium is stark. In GPT-2 (AbsPE), spectral norms decrease monotonically from layer 1 to layer 20 (mean 1.82→0.75), making deep layers approximately contractive. In Pythia (RoPE), norms follow a U-shaped curve: a narrow near-...

work page 2048

[19] [19]

RoPE produces smaller absolute distances at every gap (replacement ∼1.0 vs

(2) PE type modulates absolute distance magnitudes but not the qualitative gap =1/gap=3 structure. RoPE produces smaller absolute distances at every gap (replacement ∼1.0 vs. AbsPE 2.6 at gap =1), but both PE types show the same monotonic increase in median I/R with layer distance. PE is therefore at most a secondary modulator, not the primary axis of pro...

work page 2026