pith. sign in

arxiv: 2607.01232 · v2 · pith:LJ364HT3new · submitted 2026-07-01 · 💻 cs.LG · cs.CL

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

Pith reviewed 2026-07-02 14:45 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords transformer layersreinforcement learningLLM post-traininglayer contributionRL adaptationmiddle layersparameter efficiency
0
0 comments X

The pith

Training a single middle transformer layer recovers most gains from full-parameter RL training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how reinforcement learning updates distribute across transformer layers in LLM post-training. It demonstrates that updating parameters in only one layer, especially a middle one, often captures nearly all the performance lift obtained by updating the entire model. The authors introduce layer contribution as the fraction of total RL improvement recovered when a single layer is trained while others stay frozen. This concentration of gains in middle layers appears consistently across models, algorithms, and tasks. If the pattern holds, post-training could shift from uniform updates to selective layer focus.

Core claim

Across seven models, three RL algorithms, and tasks in math, code, and agentic settings, RL gains concentrate in a small subset of layers, frequently a single middle layer, such that training that layer alone recovers most or all of the improvement from full-parameter training.

What carries the argument

layer contribution metric, defined as the fraction of full RL improvement recovered by training one layer in isolation with others frozen.

If this is right

  • RL post-training can target only middle layers instead of all parameters for similar results.
  • The high-contribution pattern remains stable across datasets, model families, and RL algorithms.
  • Layers near the input and output ends contribute substantially less than middle layers.
  • Layer rankings derived from the metric stay consistent across different tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods to locate high-contribution layers early could reduce total compute needed for RL post-training.
  • The concentration finding raises the question of whether middle layers specialize in the adaptations RL induces.
  • Uniform parameter updates may waste effort on low-impact layers in current practice.

Load-bearing premise

Gains measured when training one layer in isolation reflect that layer's contribution during simultaneous full-parameter training without important cross-layer interactions.

What would settle it

An experiment where combining updates to the top two layers produces substantially more improvement than the sum of their individual layer contributions.

Figures

Figures reproduced from arXiv: 2607.01232 by Athanasios Glentis, Chung-Yiu Yau, Dawei Li, Hongzhou Lin, Mingyi Hong, Rizhen Hu, Zijian Zhang.

Figure 1
Figure 1. Figure 1: (a) Layer contribution (defined in §2.2) across all seven models studied in this work, plotted against depth [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer contribution C(k) across model scales. Blue: math contribution (in-domain). Black: overall contribution (averaged across all capabilities). Dashed line indicates full-parameter training (C = 1.0). Each point represents one layer trained in isolation. Math and overall contribution closely track each other across layers (Pearson r > 0.6 on 1.7B,4B and 8B), indicating that high-contribution layers achie… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-dataset consistency of layer contribution on Qwen3-1.7B-Base. Each point represents a single layer. (a) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer contribution C(k) for Qwen2.5-Math-1.5B (28 layers) trained with Dr. GRPO. Each point corresponds to one transformer layer trained in isolation. The dashed line marks full-parameter training (C = 1.0); circled markers indicate layers that reach or exceed it. Despite the change in both model family and RL algorithm, the contribution profile retains the same structure observed on Qwen3: middle layers c… view at source ↗
Figure 5
Figure 5. Figure 5: Layer contribution C(k) on the agentic task ALFWorld, trained with GiGPO. (a) Qwen2.5-1.5B-Instruct (28 layers). (b) Qwen2.5-3B-Instruct (36 layers). A representative subset of layers is trained due to computational constraints. The dashed line marks full-parameter training (C = 1.0); circled markers indicate layers that reach or exceed it. Despite the shift from mathematical reasoning to multi-step agenti… view at source ↗
Figure 6
Figure 6. Figure 6: Layer contribution C(k) for DeepSeek-Distilled-Qwen-7B (28 layers) trained with GRPO on the Skywork mathematics dataset. Only a subset of layers (0, 4, 8, 12, 14, 16, 20, 24) are trained due to computational constraints. The dashed line marks full-parameter training (C = 1.0); the circled marker indicates the layer that exceeds it. Despite differing from the Qwen3 and Qwen2.5 models in both pretraining rec… view at source ↗
Figure 7
Figure 7. Figure 7: Layer contribution-guided training strategies across model scales. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Majority voting results on OlympiadBench (Qwen3-1.7B-Base). Voting across [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-layer weight change magnitude ∥∆θk∥2 on Qwen3-1.7B-Base. Blue: full-parameter training (all layers change). Colored spikes: single-layer training (only the trained layer changes; all others remain at zero). Under full training, the weight change is relatively uniform across layers, contrasting with the highly non-uniform layer contribution profile. Under single-layer training, all trained layers underg… view at source ↗
read the original abstract

Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. In this work, we challenge this assumption through a systematic layer-wise study of RL training. Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it. To quantify this phenomenon, we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation. Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making, we observe a remarkably stable pattern: RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers. More strikingly, the same structural pattern consistently emerges: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that RL post-training gains in LLMs are highly concentrated in a small subset of (often a single middle) transformer layers. Training one layer in isolation recovers most or all of the gains from full-parameter RL training across seven models (Qwen3, Qwen2.5), three algorithms (GRPO, GiGPO, Dr. GRPO), and domains including math, code, and agentic tasks. They introduce a 'layer contribution' metric (fraction of full RL improvement recovered by isolated layer training) and report that high-contribution layers consistently appear in the middle of the stack with stable rankings across settings.

Significance. If the empirical pattern holds under the stated methodology, the result would be significant for understanding how RL adaptation is distributed in transformers and for designing more parameter-efficient RL fine-tuning methods. The reported consistency across models, algorithms, and tasks provides a broad empirical base; the introduction of a quantifiable 'layer contribution' metric is a useful framing device for future work on selective updates.

major comments (2)
  1. [Section 3 (Layer Contribution definition and experimental protocol)] The layer contribution metric (defined via isolated training of one layer with all others frozen) is load-bearing for the headline claim that 'one layer is enough.' The manuscript provides no experiments testing whether isolated gains are additive or whether non-additive cross-layer interactions (gradient interference, representation realignment, or compensatory plasticity) arise only under simultaneous full-parameter updates. Without such controls or an ablation showing that the sum of isolated contributions approximates full training, the metric may not accurately reflect each layer's role during standard RL training.
  2. [Section 4 (Results) and Appendix (training details)] Abstract and results sections report consistent patterns but give no details on statistical significance testing, variance across random seeds, or exact controls for training protocol differences (e.g., learning rate scaling, batch size, or optimizer state when only one layer is updated). These omissions make it difficult to assess whether the reported 'most of the gains' or 'surpass' cases are robust.
minor comments (2)
  1. [Section 3] Notation for 'layer contribution' should be formalized with an equation (e.g., C_l = (Perf_l - Perf_0) / (Perf_full - Perf_0)) to avoid ambiguity in how 'recovered fraction' is computed when isolated training exceeds full training.
  2. [Figures 2-5 and Tables 1-3] Figure captions and tables should explicitly state the number of runs per condition and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Section 3 (Layer Contribution definition and experimental protocol)] The layer contribution metric (defined via isolated training of one layer with all others frozen) is load-bearing for the headline claim that 'one layer is enough.' The manuscript provides no experiments testing whether isolated gains are additive or whether non-additive cross-layer interactions (gradient interference, representation realignment, or compensatory plasticity) arise only under simultaneous full-parameter updates. Without such controls or an ablation showing that the sum of isolated contributions approximates full training, the metric may not accurately reflect each layer's role during standard RL training.

    Authors: The layer contribution metric is defined to measure the fraction of full RL improvement recovered by training one layer in isolation. This directly supports the empirical claim that a single middle layer suffices to recover most gains. While non-additive interactions may exist under joint updates, our results demonstrate that they are not required to obtain the reported performance; the isolated setting already matches or exceeds full training in many cases. We will add a clarifying paragraph in Section 3 noting that the metric quantifies isolated efficacy rather than providing an exact additive decomposition of full-parameter contributions. revision: partial

  2. Referee: [Section 4 (Results) and Appendix (training details)] Abstract and results sections report consistent patterns but give no details on statistical significance testing, variance across random seeds, or exact controls for training protocol differences (e.g., learning rate scaling, batch size, or optimizer state when only one layer is updated). These omissions make it difficult to assess whether the reported 'most of the gains' or 'surpass' cases are robust.

    Authors: We agree that explicit reporting of variance and protocol controls strengthens the results. The revised manuscript will expand the Appendix to include: (i) the number of random seeds (3–5) and standard-error bars on all figures, (ii) confirmation that single-layer experiments used identical learning rates, batch sizes, and optimizer states as the full-parameter baselines, and (iii) brief mention of statistical significance for the largest reported differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical definition and measurement of layer contribution

full rationale

The paper reports direct experimental measurements of RL performance when training individual layers in isolation versus full-parameter updates. The layer contribution quantity is defined explicitly as the observed fraction of full RL gain recovered by each isolated run; this is a measurement, not a fitted parameter or derived prediction that reduces to its own inputs. No equations, ansatzes, uniqueness theorems, or self-citations are used to generate the reported results. The central claims rest on the experimental outcomes themselves across multiple models, algorithms, and tasks, with no load-bearing step that collapses by construction to a prior definition or fit within the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study that introduces the layer contribution metric through isolated-training experiments. No mathematical derivation, fitted constants in equations, or new postulated entities appear in the abstract.

axioms (1)
  • domain assumption Transformer layers can be trained independently while freezing the remainder to isolate their contribution to RL gains
    Core premise required to define and compute the layer contribution quantity from isolated runs.

pith-pipeline@v0.9.1-grok · 5800 in / 1200 out tokens · 42563 ms · 2026-07-02T14:45:50.878572+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.