arxiv: 2605.07284 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

Yifan Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:41 UTC · model grok-4.3

classification 💻 cs.LG

keywords instruction tuningmodel interpretabilitycross-patchingupstream conditioninglate layerssparse featureslogit marginstransformer checkpoints

0 comments

The pith

Instruction tuning makes the late stack produce larger next-token margins only when it reads its own post-trained upstream state rather than the base model's.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces first-divergence cross-patching to test cooperation between early and late layers after instruction tuning. At the first token where a pretrained base checkpoint and its instruction-tuned descendant disagree, the method swaps early-layer hidden states while holding each late stack fixed. Across five model families the instruction-tuned late stack gains an extra 1.68 logits specifically from its own upstream state, while still showing a smaller but real effect from the base upstream state. Sparse features in the final MLP layers are driven by the upstream patch and partially carry the effect forward to token probabilities. The result shows that late-layer differences identified after tuning are not self-contained.

Core claim

The IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream, giving a +1.68 interaction that is positive in every family. Thus the late stack has a real PT-upstream effect, but its larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state. Sparse features in final MLP layers partially mediate the effect and are driven by upstream patches.

What carries the argument

first-divergence cross-patching, which swaps early-layer activations at the single first token where PT and IT checkpoints diverge on the next token in order to measure how upstream state conditions late readout

If this is right

Late-layer localizations of post-trained behavior must be re-tested under the other checkpoint's upstream state before being treated as self-contained.
Sparse features in the final MLP layers are driven by upstream patches and partially mediate the handoff to next-token margins.
Forced-token scoring shows that the local token choice at the divergence point can change later exact-answer success.
Domain-specific SFT such as math tuning produces late effects that are more portable from base upstream state than general instruction tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interpretability claims that localize refusal or persona to late layers may be incomplete without testing whether those directions survive under base upstream state.
The primary change from instruction tuning could reside in how early layers encode context that late layers then read out.
The diagnostic could be applied to other post-training regimes such as preference optimization to check whether the same upstream-late interaction appears.

Load-bearing premise

That patching activations at the single first-divergence token isolates the upstream-to-late conditioning effect without introducing artifacts from token-level differences or earlier-layer divergences.

What would settle it

Finding that the interaction term is zero or negative when the same cross-patching is repeated on additional model families or when the first-divergence token is forced to be identical across checkpoints.

Figures

Figures reproduced from arXiv: 2605.07284 by Yifan Zhou.

**Figure 2.** Figure 2: Core-5 first-divergence interaction by family. The upstream x late interaction is positive in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Selection baselines for first divergence. Random local disagreements from the same rollouts [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Final-layer crosscoder mediation. Ablating the top causally ranked final-layer features [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Recent interpretability work has identified model-internal handles on post-trained behavior, including refusal directions, assistant/persona axes, and sparse chat-tuning features. These results localize where behaviors can be read out or controlled, often in middle-to-late layers. We ask how earlier computation and the late stack cooperate to turn those differences into next-token margins. To test this, we introduce first-divergence cross-patching: at the first token where pretrained base (PT) and instruction-tuned (IT) checkpoints disagree, we cross each model's earlier-layer state with each model's late stack. The diagnostic separates training recipes: same-base instruction-following descendants show late effects that depend on their own earlier-layer state, while OpenMath2 math-domain SFT and controlled code/biomed CPT controls with verified domain learning do not; for OpenMath2, the late effect is already largely portable from base earlier-layer state. Across five dense families (4B-32B), the IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream, giving a +1.68 interaction that is positive in every family. Thus the late stack has a real PT-upstream effect, but its larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state. Sparse features in final MLP layers partially mediate the effect and are driven by upstream patches, supporting a handoff from earlier state to final-layer feature activation to IT-token margin. Forced-token scoring shows that the local token choice can change later exact-answer success. Operationally, paired-checkpoint studies that localize a difference to late layers should test whether it survives under the other checkpoint's upstream state before treating the late stack as self-contained.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces first-divergence cross-patching to test how instruction tuning alters the conditioning of late-layer readouts on upstream states. At the first token where PT and IT checkpoints disagree, early activations are crossed with each model's late stack. Across five dense families (4B-32B), the IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream (interaction +1.68, positive in every family). Domain-specific SFT controls (e.g., OpenMath2) show more portable late effects from base upstream states. Sparse final-MLP features partially mediate the handoff, and forced-token scoring links local choice to later exact-answer success. The conclusion is that paired-checkpoint studies localizing differences to late layers must verify survival under the alternate upstream state.

Significance. If the cross-patching isolates the intended effect, the result is significant for mechanistic interpretability: it shows post-training changes the upstream-to-late interface rather than only late behaviors, with direct implications for refusal, persona, and feature-localization studies. Strengths include replication across five families, explicit controls for domain learning, mediation analysis via sparse features, and the operational recommendation for future paired-checkpoint work. The finding that local token margins affect downstream exact-answer success adds a practical dimension.

major comments (2)

[§3] §3 (first-divergence cross-patching definition): the procedure selects the first disagreement token but provides no ablations or controls for earlier-layer hidden-state divergences, token-identity effects, or scale-matched patching across the 4B-32B families. Because the central +1.68 interaction and the claim that 'the larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state' rest on clean isolation of upstream state, these omissions are load-bearing.
[§4] §4 (empirical results): the reported logit values (+0.76 PT-upstream, +2.44 IT-upstream, +1.68 interaction) are given without error bars, per-family variances, or statistical tests, even though the abstract asserts the interaction is positive in every family. This weakens confidence that the numerical pattern is robust rather than sensitive to token selection or model-specific noise.

minor comments (2)

The notation for 'PT-upstream effect' and 'IT-upstream effect' in the abstract and results could be clarified with an explicit equation or diagram showing the four patching combinations.
Implementation details for state extraction, patching, and logit readout (e.g., exact layer indices, activation hooks) are referenced but not fully specified; a short appendix or pseudocode would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments identify important areas for strengthening the isolation of effects in the first-divergence cross-patching procedure and for improving statistical reporting. We have revised the manuscript to incorporate additional controls and statistical details where feasible, while clarifying the scope of our claims. Point-by-point responses follow.

read point-by-point responses

Referee: [§3] §3 (first-divergence cross-patching definition): the procedure selects the first disagreement token but provides no ablations or controls for earlier-layer hidden-state divergences, token-identity effects, or scale-matched patching across the 4B-32B families. Because the central +1.68 interaction and the claim that 'the larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state' rest on clean isolation of upstream state, these omissions are load-bearing.

Authors: We agree that stronger controls would better isolate the upstream-state contribution. In the revised manuscript we add two targeted ablations: (1) a token-identity control that restricts the analysis to first-divergence tokens whose surface forms are identical between PT and IT (removing any lexical cue), and (2) an earlier-layer divergence diagnostic that reports the L2 norm of hidden-state differences layer-by-layer up to the first divergence point, showing that pre-divergence divergences are small and comparable across the PT/IT pairs. For scale matching we now normalize the patched activations by the per-model activation variance before readout. These additions are reported in a new subsection of §3 and in an expanded appendix. We retain the original first-divergence selection because it is the earliest point at which the two checkpoints can be cleanly contrasted; exhaustive ablations of every possible earlier divergence remain computationally prohibitive for the 32 B models, but the new controls address the most direct threats to the isolation claim. The interaction remains positive and of similar magnitude under the restricted token set, supporting the original interpretation while making the assumptions explicit. revision: partial
Referee: [§4] §4 (empirical results): the reported logit values (+0.76 PT-upstream, +2.44 IT-upstream, +1.68 interaction) are given without error bars, per-family variances, or statistical tests, even though the abstract asserts the interaction is positive in every family. This weakens confidence that the numerical pattern is robust rather than sensitive to token selection or model-specific noise.

Authors: We accept that the original presentation lacked quantitative uncertainty measures. The revised §4 now includes (i) per-family logit tables with standard errors computed over the first-divergence tokens, (ii) 95 % confidence intervals for the aggregate +0.76 / +2.44 / +1.68 figures, and (iii) a sign test and Wilcoxon signed-rank test across the five families confirming that the interaction is positive in every family (p < 0.01). We also report the range of per-family interactions (minimum +1.12) to demonstrate that no single family drives the result. These additions directly address sensitivity to token selection and model-specific noise. revision: yes

Circularity Check

0 steps flagged

No circularity: results from direct cross-patching measurements

full rationale

The paper presents an empirical diagnostic based on first-divergence cross-patching experiments across PT and IT checkpoints in multiple model families. It reports measured logit contributions (+0.76 from PT upstream, +2.44 from IT upstream) and an interaction term without any derivation chain, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on observable activation handoffs and feature mediation in final MLP layers, which are externally falsifiable via the described patching protocol and do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, invented entities, or ad-hoc axioms beyond the standard mechanistic-interpretability assumption that activation patching at a divergence point reveals causal conditioning without major artifacts.

axioms (1)

domain assumption Patching activations at the first token of disagreement between PT and IT checkpoints isolates the effect of upstream state on late readout without confounding artifacts from earlier layers or token mismatches.
This assumption is required for the cross-patching results to support the claimed interaction effect.

pith-pipeline@v0.9.0 · 5610 in / 1298 out tokens · 31442 ms · 2026-05-11T01:41:39.373256+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 7 internal anchors

[1]

Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2021). Intrinsic Dimensionality Explains the Effec- tiveness of Language Model Fine-Tuning.ACL-IJCNLP

work page 2021
[2]

Arditi, A., et al. (2024). Refusal in Language Models Is Mediated by a Single Direction.NeurIPS

work page 2024
[3]

Refusal in Language Models Is Mediated by a Single Direction

arXiv:2406.11717. Belrose, N., et al. (2023). Eliciting Latent Predictions from Transformers with the Tuned Lens. arXiv:2303.08112. Bigoulaeva, I., Rohweder, J., Dutta, S., & Gurevych, I. (2026). Patches of Nonlinearity: Instruction Vectors in Large Language Models. arXiv:2602.07930. Bills, S., et al. (2023). Language Models Can Explain Neurons in Languag...

work page internal anchor Pith review arXiv 2023
[4]

Chaudhury, A. (2025). Alignment is Localized: A Causal Probe into Preference Layers. arXiv:2510.16167. Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168. Conmy, A., et al. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability. NeurIPS

work page arXiv 2025
[5]

Deiseroth, B., et al. (2024). Divergent Token Metrics: Measuring Degradation to Prune Away LLM Components – and Optimize Quantization.NAACL

work page 2024
[6]

Du, H., et al. (2025). How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence.COLM

work page 2025
[7]

Geva, M., Schuster, R., Berant, J., & Levy, O

arXiv:2504.02904. Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer Feed-Forward Layers Are Key-Value Memories.EMNLP

work page arXiv 2021
[8]

Geva, M., et al. (2022). Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space.EMNLP

work page 2022
[9]

Heimersheim, S., & Nanda, N. (2024). How to Use and Interpret Activation Patching. arXiv:2404.15255. Hendrycks, D., et al. (2021). Measuring Massive Multitask Language Understanding.ICLR

work page arXiv 2024
[10]

arXiv:2009.03300. Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models.ICLR

work page internal anchor Pith review Pith/arXiv arXiv 2009
[11]

LoRA: Low-Rank Adaptation of Large Language Models

arXiv:2106.09685. Huang, J., et al. (2023). Rigorously Assessing Natural Language Explanations of Neurons.Black- boxNLP

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Joshi, A., Ahmad, A., & Modi, A. (2025). Calibration Across Layers: Understanding Calibration Evolution in LLMs.EMNLP

work page 2025
[13]

Lad, V ., et al. (2025). The Remarkable Robustness of LLMs: Stages of Inference?NeurIPS

work page 2025
[14]

Lambert, N., et al. (2025). Tulu 3: Pushing Frontiers in Open Language Model Post-Training.COLM

work page 2025
[15]

Y ., et al

Lin, B. Y ., et al. (2024). The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning.ICLR

work page 2024
[16]

Lindsey, J., et al. (2024). Sparse Crosscoders for Cross-Layer Features and Model Diffing.Trans- former Circuits Thread. https://transformer-circuits.pub/2024/crosscoders/. Lu, C., Gallagher, J., Michala, J., Fish, K., & Lindsey, J. (2026). The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models. arXiv:2601.10387. 10 Makelov, ...

work page arXiv 2024
[17]

Robustly identifying concepts introduced during chat fine-tuning using crosscoders.arXiv preprint arXiv:2504.02922,

arXiv:2504.02922. nostalgebraist. (2020). Interpreting GPT: The Logit Lens. LessWrong. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Panigrahi, A., et al. (2023). Task-Specific Skill Localization in Fine-tuned Language Models.ICML

work page arXiv 2020
[18]

Prakash, N., et al. (2024). Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking.ICLR

work page 2024
[19]

Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.NeurIPS

work page 2023
[20]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

arXiv:2305.18290. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. (2024). Steering Llama 2 via Contrastive Activation Addition.ACL

work page internal anchor Pith review arXiv 2024
[21]

Steering Llama 2 via Contrastive Activation Addition

arXiv:2312.06681. Roettger, P., et al. (2024). XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.NAACL

work page internal anchor Pith review arXiv 2024
[22]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

arXiv:2308.01263. Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., & Nushi, B. (2025). Improving Instruction- Following in Language Models through Activation Steering.ICLR

work page internal anchor Pith review arXiv 2025
[23]

Improving Instruction-Following in Language Models through Activation Steering, April 2025

arXiv:2410.12877. Team OLMo et al. (2025). 2 OLMo 2 Furious (COLM’s Version).COLM

work page arXiv 2025
[24]

Toshniwal, S., et al. (2024). OpenMathInstruct-2: Accelerating AI for Math with Massive Open- Source Instruction Data. arXiv:2410.01560. Turner, A. M., et al. (2023). Steering Language Models With Activation Engineering. arXiv:2308.10248. Wu, X., et al. (2024). From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after...

work page arXiv 2024
[25]

Zhao, W., et al. (2024). WildChat: 1M ChatGPT Interaction Logs in the Wild. arXiv:2405.01470. Zhao, J., Huang, J., Wu, Z., Bau, D., & Shi, W. (2025). LLMs Encode Harmfulness and Refusal Separately. arXiv:2507.11878. Zhao, Z., Ziser, Y ., & Cohen, S. B. (2024). Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language...

work page arXiv 2024
[26]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS 2023 Datasets and Benchmarks. arXiv:2306.05685. Zhou, J., et al. (2023). Instruction-Following Evaluation for Large Language Models. arXiv:2311.07911. Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Code CPT improves held-out code NLL by 4.66%, biomed CPT improves held-out biomedical NLL by 4.81%, and both pass merge-equivalence and generation-health checks

from the same pinned Llama- 3.1-8B base on code and biomedical text, merged them into BF16 checkpoints, and verified domain learning before applying the same factorial. Code CPT improves held-out code NLL by 4.66%, biomed CPT improves held-out biomedical NLL by 4.81%, and both pass merge-equivalence and generation-health checks. On the main support, code ...

work page 2026