Recognition: no theorem link
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
Pith reviewed 2026-05-11 01:41 UTC · model grok-4.3
The pith
Instruction tuning makes the late stack produce larger next-token margins only when it reads its own post-trained upstream state rather than the base model's.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream, giving a +1.68 interaction that is positive in every family. Thus the late stack has a real PT-upstream effect, but its larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state. Sparse features in final MLP layers partially mediate the effect and are driven by upstream patches.
What carries the argument
first-divergence cross-patching, which swaps early-layer activations at the single first token where PT and IT checkpoints diverge on the next token in order to measure how upstream state conditions late readout
If this is right
- Late-layer localizations of post-trained behavior must be re-tested under the other checkpoint's upstream state before being treated as self-contained.
- Sparse features in the final MLP layers are driven by upstream patches and partially mediate the handoff to next-token margins.
- Forced-token scoring shows that the local token choice at the divergence point can change later exact-answer success.
- Domain-specific SFT such as math tuning produces late effects that are more portable from base upstream state than general instruction tuning.
Where Pith is reading between the lines
- Interpretability claims that localize refusal or persona to late layers may be incomplete without testing whether those directions survive under base upstream state.
- The primary change from instruction tuning could reside in how early layers encode context that late layers then read out.
- The diagnostic could be applied to other post-training regimes such as preference optimization to check whether the same upstream-late interaction appears.
Load-bearing premise
That patching activations at the single first-divergence token isolates the upstream-to-late conditioning effect without introducing artifacts from token-level differences or earlier-layer divergences.
What would settle it
Finding that the interaction term is zero or negative when the same cross-patching is repeated on additional model families or when the first-divergence token is forced to be identical across checkpoints.
Figures
read the original abstract
Recent interpretability work has identified model-internal handles on post-trained behavior, including refusal directions, assistant/persona axes, and sparse chat-tuning features. These results localize where behaviors can be read out or controlled, often in middle-to-late layers. We ask how earlier computation and the late stack cooperate to turn those differences into next-token margins. To test this, we introduce first-divergence cross-patching: at the first token where pretrained base (PT) and instruction-tuned (IT) checkpoints disagree, we cross each model's earlier-layer state with each model's late stack. The diagnostic separates training recipes: same-base instruction-following descendants show late effects that depend on their own earlier-layer state, while OpenMath2 math-domain SFT and controlled code/biomed CPT controls with verified domain learning do not; for OpenMath2, the late effect is already largely portable from base earlier-layer state. Across five dense families (4B-32B), the IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream, giving a +1.68 interaction that is positive in every family. Thus the late stack has a real PT-upstream effect, but its larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state. Sparse features in final MLP layers partially mediate the effect and are driven by upstream patches, supporting a handoff from earlier state to final-layer feature activation to IT-token margin. Forced-token scoring shows that the local token choice can change later exact-answer success. Operationally, paired-checkpoint studies that localize a difference to late layers should test whether it survives under the other checkpoint's upstream state before treating the late stack as self-contained.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces first-divergence cross-patching to test how instruction tuning alters the conditioning of late-layer readouts on upstream states. At the first token where PT and IT checkpoints disagree, early activations are crossed with each model's late stack. Across five dense families (4B-32B), the IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream (interaction +1.68, positive in every family). Domain-specific SFT controls (e.g., OpenMath2) show more portable late effects from base upstream states. Sparse final-MLP features partially mediate the handoff, and forced-token scoring links local choice to later exact-answer success. The conclusion is that paired-checkpoint studies localizing differences to late layers must verify survival under the alternate upstream state.
Significance. If the cross-patching isolates the intended effect, the result is significant for mechanistic interpretability: it shows post-training changes the upstream-to-late interface rather than only late behaviors, with direct implications for refusal, persona, and feature-localization studies. Strengths include replication across five families, explicit controls for domain learning, mediation analysis via sparse features, and the operational recommendation for future paired-checkpoint work. The finding that local token margins affect downstream exact-answer success adds a practical dimension.
major comments (2)
- [§3] §3 (first-divergence cross-patching definition): the procedure selects the first disagreement token but provides no ablations or controls for earlier-layer hidden-state divergences, token-identity effects, or scale-matched patching across the 4B-32B families. Because the central +1.68 interaction and the claim that 'the larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state' rest on clean isolation of upstream state, these omissions are load-bearing.
- [§4] §4 (empirical results): the reported logit values (+0.76 PT-upstream, +2.44 IT-upstream, +1.68 interaction) are given without error bars, per-family variances, or statistical tests, even though the abstract asserts the interaction is positive in every family. This weakens confidence that the numerical pattern is robust rather than sensitive to token selection or model-specific noise.
minor comments (2)
- The notation for 'PT-upstream effect' and 'IT-upstream effect' in the abstract and results could be clarified with an explicit equation or diagram showing the four patching combinations.
- Implementation details for state extraction, patching, and logit readout (e.g., exact layer indices, activation hooks) are referenced but not fully specified; a short appendix or pseudocode would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments identify important areas for strengthening the isolation of effects in the first-divergence cross-patching procedure and for improving statistical reporting. We have revised the manuscript to incorporate additional controls and statistical details where feasible, while clarifying the scope of our claims. Point-by-point responses follow.
read point-by-point responses
-
Referee: [§3] §3 (first-divergence cross-patching definition): the procedure selects the first disagreement token but provides no ablations or controls for earlier-layer hidden-state divergences, token-identity effects, or scale-matched patching across the 4B-32B families. Because the central +1.68 interaction and the claim that 'the larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state' rest on clean isolation of upstream state, these omissions are load-bearing.
Authors: We agree that stronger controls would better isolate the upstream-state contribution. In the revised manuscript we add two targeted ablations: (1) a token-identity control that restricts the analysis to first-divergence tokens whose surface forms are identical between PT and IT (removing any lexical cue), and (2) an earlier-layer divergence diagnostic that reports the L2 norm of hidden-state differences layer-by-layer up to the first divergence point, showing that pre-divergence divergences are small and comparable across the PT/IT pairs. For scale matching we now normalize the patched activations by the per-model activation variance before readout. These additions are reported in a new subsection of §3 and in an expanded appendix. We retain the original first-divergence selection because it is the earliest point at which the two checkpoints can be cleanly contrasted; exhaustive ablations of every possible earlier divergence remain computationally prohibitive for the 32 B models, but the new controls address the most direct threats to the isolation claim. The interaction remains positive and of similar magnitude under the restricted token set, supporting the original interpretation while making the assumptions explicit. revision: partial
-
Referee: [§4] §4 (empirical results): the reported logit values (+0.76 PT-upstream, +2.44 IT-upstream, +1.68 interaction) are given without error bars, per-family variances, or statistical tests, even though the abstract asserts the interaction is positive in every family. This weakens confidence that the numerical pattern is robust rather than sensitive to token selection or model-specific noise.
Authors: We accept that the original presentation lacked quantitative uncertainty measures. The revised §4 now includes (i) per-family logit tables with standard errors computed over the first-divergence tokens, (ii) 95 % confidence intervals for the aggregate +0.76 / +2.44 / +1.68 figures, and (iii) a sign test and Wilcoxon signed-rank test across the five families confirming that the interaction is positive in every family (p < 0.01). We also report the range of per-family interactions (minimum +1.12) to demonstrate that no single family drives the result. These additions directly address sensitivity to token selection and model-specific noise. revision: yes
Circularity Check
No circularity: results from direct cross-patching measurements
full rationale
The paper presents an empirical diagnostic based on first-divergence cross-patching experiments across PT and IT checkpoints in multiple model families. It reports measured logit contributions (+0.76 from PT upstream, +2.44 from IT upstream) and an interaction term without any derivation chain, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on observable activation handoffs and feature mediation in final MLP layers, which are externally falsifiable via the described patching protocol and do not reduce to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Patching activations at the first token of disagreement between PT and IT checkpoints isolates the effect of upstream state on late readout without confounding artifacts from earlier layers or token mismatches.
Reference graph
Works this paper leans on
-
[1]
Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2021). Intrinsic Dimensionality Explains the Effec- tiveness of Language Model Fine-Tuning.ACL-IJCNLP
work page 2021
-
[2]
Arditi, A., et al. (2024). Refusal in Language Models Is Mediated by a Single Direction.NeurIPS
work page 2024
-
[3]
Refusal in Language Models Is Mediated by a Single Direction
arXiv:2406.11717. Belrose, N., et al. (2023). Eliciting Latent Predictions from Transformers with the Tuned Lens. arXiv:2303.08112. Bigoulaeva, I., Rohweder, J., Dutta, S., & Gurevych, I. (2026). Patches of Nonlinearity: Instruction Vectors in Large Language Models. arXiv:2602.07930. Bills, S., et al. (2023). Language Models Can Explain Neurons in Languag...
work page internal anchor Pith review arXiv 2023
-
[4]
Chaudhury, A. (2025). Alignment is Localized: A Causal Probe into Preference Layers. arXiv:2510.16167. Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168. Conmy, A., et al. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability. NeurIPS
-
[5]
Deiseroth, B., et al. (2024). Divergent Token Metrics: Measuring Degradation to Prune Away LLM Components – and Optimize Quantization.NAACL
work page 2024
-
[6]
Du, H., et al. (2025). How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence.COLM
work page 2025
-
[7]
Geva, M., Schuster, R., Berant, J., & Levy, O
arXiv:2504.02904. Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer Feed-Forward Layers Are Key-Value Memories.EMNLP
-
[8]
Geva, M., et al. (2022). Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the V ocabulary Space.EMNLP
work page 2022
- [9]
-
[10]
arXiv:2009.03300. Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models.ICLR
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[11]
LoRA: Low-Rank Adaptation of Large Language Models
arXiv:2106.09685. Huang, J., et al. (2023). Rigorously Assessing Natural Language Explanations of Neurons.Black- boxNLP
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Joshi, A., Ahmad, A., & Modi, A. (2025). Calibration Across Layers: Understanding Calibration Evolution in LLMs.EMNLP
work page 2025
-
[13]
Lad, V ., et al. (2025). The Remarkable Robustness of LLMs: Stages of Inference?NeurIPS
work page 2025
-
[14]
Lambert, N., et al. (2025). Tulu 3: Pushing Frontiers in Open Language Model Post-Training.COLM
work page 2025
-
[15]
Lin, B. Y ., et al. (2024). The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning.ICLR
work page 2024
-
[16]
Lindsey, J., et al. (2024). Sparse Crosscoders for Cross-Layer Features and Model Diffing.Trans- former Circuits Thread. https://transformer-circuits.pub/2024/crosscoders/. Lu, C., Gallagher, J., Michala, J., Fish, K., & Lindsey, J. (2026). The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models. arXiv:2601.10387. 10 Makelov, ...
-
[17]
arXiv:2504.02922. nostalgebraist. (2020). Interpreting GPT: The Logit Lens. LessWrong. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Panigrahi, A., et al. (2023). Task-Specific Skill Localization in Fine-tuned Language Models.ICML
-
[18]
Prakash, N., et al. (2024). Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking.ICLR
work page 2024
-
[19]
Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.NeurIPS
work page 2023
-
[20]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
arXiv:2305.18290. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. (2024). Steering Llama 2 via Contrastive Activation Addition.ACL
work page internal anchor Pith review arXiv 2024
-
[21]
Steering Llama 2 via Contrastive Activation Addition
arXiv:2312.06681. Roettger, P., et al. (2024). XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.NAACL
work page internal anchor Pith review arXiv 2024
-
[22]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
arXiv:2308.01263. Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., & Nushi, B. (2025). Improving Instruction- Following in Language Models through Activation Steering.ICLR
work page internal anchor Pith review arXiv 2025
-
[23]
Improving Instruction-Following in Language Models through Activation Steering, April 2025
arXiv:2410.12877. Team OLMo et al. (2025). 2 OLMo 2 Furious (COLM’s Version).COLM
-
[24]
Toshniwal, S., et al. (2024). OpenMathInstruct-2: Accelerating AI for Math with Massive Open- Source Instruction Data. arXiv:2410.01560. Turner, A. M., et al. (2023). Steering Language Models With Activation Engineering. arXiv:2308.10248. Wu, X., et al. (2024). From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after...
-
[25]
Zhao, W., et al. (2024). WildChat: 1M ChatGPT Interaction Logs in the Wild. arXiv:2405.01470. Zhao, J., Huang, J., Wu, Z., Bau, D., & Shi, W. (2025). LLMs Encode Harmfulness and Refusal Separately. arXiv:2507.11878. Zhao, Z., Ziser, Y ., & Cohen, S. B. (2024). Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language...
-
[26]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS 2023 Datasets and Benchmarks. arXiv:2306.05685. Zhou, J., et al. (2023). Instruction-Following Evaluation for Large Language Models. arXiv:2311.07911. Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
from the same pinned Llama- 3.1-8B base on code and biomedical text, merged them into BF16 checkpoints, and verified domain learning before applying the same factorial. Code CPT improves held-out code NLL by 4.66%, biomed CPT improves held-out biomedical NLL by 4.81%, and both pass merge-equivalence and generation-health checks. On the main support, code ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.