Recognition: no theorem link
The Convergence Gap: Instruction-Tuned Language Models Stabilize Later in the Forward Pass
Pith reviewed 2026-05-11 01:50 UTC · model grok-4.3
The pith
Instruction-tuned language models stay farther from their final next-token predictions deeper into the layer stack than their pretrained counterparts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across paired checkpoints, instruction-tuned models exhibit larger convergence gaps, with their intermediate layer distributions remaining more distant from the final output distribution even in later layers. This delayed stabilization persists under endpoint-matched readouts, same-history checks, and fixed-template replay. Matched-prefix interventions demonstrate that late MLP modules are the strongest bidirectional control point: grafting late IT components into PT hosts raises late KL divergence by 0.34 nats while the reverse swap reduces it by 0.51 nats, far exceeding effects from random perturbations.
What carries the argument
The convergence gap, which decodes each layer's next-token distribution and computes its distance to the model's final distribution to track stabilization timing.
If this is right
- Instruction-tuned checkpoints commit to next-token predictions later than pretrained ones under native prompting.
- Late MLP computation provides the largest tested leverage for shifting the timing of stabilization.
- Bidirectional grafting of late MLP modules transfers the convergence behavior between model types.
- The delayed-settling pattern survives controls that match final outputs, input histories, and prompt templates.
- This supplies a robust internal signature distinguishing post-training effects from pretraining dynamics.
Where Pith is reading between the lines
- The result suggests developers could target late MLP windows when they want to alter how quickly a model locks in its output.
- Similar layer-wise diagnostics might reveal whether other post-training methods such as preference tuning produce the same late-commitment signature.
- If the pattern generalizes, early-layer interventions might be less effective at changing final behavior than late-layer ones in tuned models.
- The diagnostic could be applied to track how architectural changes or different training objectives shift stabilization points.
Load-bearing premise
That the layer-wise distance between intermediate and final next-token distributions validly measures when a model has committed to its prediction, and that the paired checkpoints differ mainly because of instruction tuning rather than other training differences.
What would settle it
Measuring the convergence gap on a fresh pair of pretrained and instruction-tuned checkpoints and finding no consistent difference in stabilization timing, or finding that late MLP swaps no longer transfer the gap size, would falsify the central claim.
Figures
read the original abstract
Final outputs hide when a checkpoint commits to its next-token prediction. We introduce the convergence gap, a model-diffing diagnostic that decodes each layer's next-token distribution and measures its distance to the model's own final distribution. Across six paired pretrained and instruction-tuned checkpoints in native prompting regimes, instruction-tuned checkpoints remain farther from their final predictions later into the stack. The effect persists under endpoint-matched raw and tuned readouts, endpoint-free same-history checks, and fixed-history template replay. Matched-prefix interventions identify late MLP windows as the largest tested leverage point: late IT grafts into PT hosts increase late KL by +0.34 nats, while PT-late swaps into IT hosts reduce it by -0.51 nats; matched random late perturbations give only +0.003 versus +0.327 for the true late graft. A preselected Gemma case study provides behavior-facing plausibility for the same late swap, without serving as a benchmark claim. These results identify a robust predictiondynamics signature of post-training: released instruction-following checkpoints tend to settle later, and late MLP computation is the strongest tested bidirectional handle on that delay under matched histories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the 'convergence gap' diagnostic, which decodes next-token distributions from each layer and measures their KL divergence to the model's final distribution. It reports that, across six paired pretrained and instruction-tuned checkpoints under native prompting, instruction-tuned models exhibit larger gaps (i.e., stabilize later in the forward pass). The effect survives endpoint-matched readouts, same-history controls, and fixed-history template replay. Bidirectional late-MLP grafting experiments show that inserting IT late layers into PT hosts increases late KL by +0.34 nats while the reverse reduces it by -0.51 nats, with matched random perturbations producing negligible change (+0.003). A Gemma case study illustrates behavioral plausibility of the late swaps.
Significance. If the central empirical pattern holds, the work supplies a concrete, measurable signature of how post-training alters internal prediction dynamics in LLMs. The bidirectional grafting results provide a stronger handle than pure correlation and point to late MLP computation as a high-leverage locus. This could inform future mechanistic interpretability, model editing, and analysis of alignment effects, provided the metric is shown to be robust to alternative explanations.
major comments (2)
- [checkpoint selection and pairing description] The attribution of the convergence gap to instruction tuning itself rests on the six PT/IT pairs differing primarily in the post-training phase. The manuscript does not supply a side-by-side accounting of training data mixtures, total tokens, learning-rate schedules, or any intermediate alignment steps between the members of each pair (see the checkpoint description and experimental setup). Without such matching details, systematic differences in pre-instruction-tuning regimes remain a plausible confound for the observed later stabilization.
- [grafting experiments] The intervention results report specific KL shifts (+0.34 nats for IT-to-PT grafts, -0.51 nats for PT-to-IT grafts) but do not include per-layer variance, number of tokens or prompts over which the averages are taken, or statistical tests. Because these numbers are presented as the strongest evidence for late-MLP causality, the absence of uncertainty quantification weakens the load-bearing claim (see the grafting experiments section).
minor comments (3)
- [methods / metric definition] The convergence gap is introduced via prose description; a compact equation (e.g., defining D_l as the KL between layer-l and final distributions) would improve precision and allow readers to verify the exact divergence used.
- [figures] Figure panels that overlay multiple controls would benefit from explicit legends or captions stating which lines correspond to endpoint-matched versus same-history conditions.
- [controls subsection] The abstract states that the effect 'persists under ... fixed-history template replay,' yet the main text could add one sentence clarifying the exact template construction and whether it is identical across PT and IT models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on potential confounds in checkpoint pairing and the need for stronger statistical support in the grafting experiments. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [checkpoint selection and pairing description] The attribution of the convergence gap to instruction tuning itself rests on the six PT/IT pairs differing primarily in the post-training phase. The manuscript does not supply a side-by-side accounting of training data mixtures, total tokens, learning-rate schedules, or any intermediate alignment steps between the members of each pair (see the checkpoint description and experimental setup). Without such matching details, systematic differences in pre-instruction-tuning regimes remain a plausible confound for the observed later stabilization.
Authors: The six pairs consist of publicly released checkpoints from the same model families and organizations (e.g., Llama-2/3 PT vs. IT, Gemma PT vs. IT), where the instruction-tuned versions are the official post-trained releases built on the corresponding pretrained bases. Official technical reports confirm that differences are concentrated in the post-training phase, with shared architectures and pretraining data up to that point. While exact proprietary data mixtures and schedules are not always disclosed by providers, the pattern holds consistently across all six independent pairs, making a single pre-training confound unlikely. In revision we will add an expanded checkpoint table listing model sizes, release sources, and all publicly documented post-training details. revision: partial
-
Referee: [grafting experiments] The intervention results report specific KL shifts (+0.34 nats for IT-to-PT grafts, -0.51 nats for PT-to-IT grafts) but do not include per-layer variance, number of tokens or prompts over which the averages are taken, or statistical tests. Because these numbers are presented as the strongest evidence for late-MLP causality, the absence of uncertainty quantification weakens the load-bearing claim (see the grafting experiments section).
Authors: We agree that uncertainty quantification strengthens the grafting claims. The reported KL shifts are averages computed over a held-out evaluation set of 512 prompts (each 128 tokens long). In the revised version we will report per-layer standard errors, the precise token/prompt counts, and results from paired statistical tests (e.g., bootstrap confidence intervals) to assess the reliability of the +0.34 nats and -0.51 nats effects. The matched random-perturbation control already shows that non-specific late-layer changes produce negligible shifts (+0.003). revision: yes
Circularity Check
No circularity: purely empirical diagnostic with direct measurements
full rationale
The paper defines the convergence gap directly as the distributional distance (e.g., KL) between each layer's decoded next-token distribution and the model's own final distribution. All results consist of empirical comparisons across six paired checkpoints under multiple controls (endpoint-matched readouts, same-history checks, template replay, and graft interventions). No equations, fitted parameters, or derivations reduce any reported quantity to its inputs by construction; the gap is computed from the model's forward pass outputs without self-reference or renaming of known results. Self-citations, if present, are not load-bearing for the central empirical claim, which remains falsifiable via independent replication on the same or new checkpoints.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption KL divergence between a layer's next-token distribution and the final distribution quantifies how much the layer has not yet converged to the model's committed prediction.
invented entities (1)
-
convergence gap
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Belrose, N., et al. (2023). Eliciting Latent Predictions from Transformers with the Tuned Lens. arXiv:2303.08112. Chuang, Y ., et al. (2024). DoLA: Decoding by Contrasting Layers Improves Factuality.ICLR
work page internal anchor Pith review arXiv 2023
-
[2]
Du, H., et al. (2025). How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence.COLM
work page 2025
-
[3]
Geva, M., Schuster, R., Berant, J., and Levy, O. (2022). Transformer Feed-Forward Layers Are Key-Value Memories.EMNLP
work page 2022
-
[4]
Geva, M., Caciularu, A., Wang, K. R., and Goldberg, Y . (2022). Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space.EMNLP
work page 2022
- [5]
-
[6]
H., Gurnee, W., and Tegmark, M
Lad, V ., Lee, J. H., Gurnee, W., and Tegmark, M. (2025). The Remarkable Robustness of LLMs: Stages of Inference?NeurIPS
work page 2025
-
[7]
Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Batson, J., and Olah, C. (2024). Sparse Cross- coders for Cross-Layer Features and Model Diffing.Transformer Circuits Thread. Minder, J., Dumas, C., Juang, C., Chugtai, B., and Nanda, N. (2025). Robustly Identifying Concepts Introduced During Chat Fine-Tuning Using Crosscoders. arXiv:2504.02922. Prakash...
-
[8]
Wu, X., Yao, W., Chen, J., Pan, X., Wang, X., Liu, N., and Yu, D. (2024). From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning.NAACL
work page 2024
-
[9]
Zhao, Z., Ziser, Y ., and Cohen, S. B. (2024). Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models.EMNLP
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.