pith. sign in

arxiv: 2605.16600 · v1 · pith:CZWASWXRnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI· cs.CL

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

Pith reviewed 2026-05-20 20:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords transformer weight spacepretraining alignment asymmetrygradient accumulationread write pathwaysresidual streamattention mechanismsprediction subspaceouter product updates
0
0 comments X

The pith

Pretraining imprints prediction geometry on the write pathways of transformer weights, while alignment concentrates changes in the read pathways.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that cross-entropy pretraining and preference alignment update the same transformer weights but produce geometrically distinct patterns. Alignment weight changes focus on the read components of attention, aligning with the main directions of input activations, whereas the write components stay spread out relative to the model's prediction directions. This pattern arises because updates are formed as outer products of gradients and activations, inheriting anisotropy from whichever side carries more structure after pretraining. A sympathetic reader would care because this geometry could clarify why alignment requires far less data than pretraining and why it tends to preserve rather than overwrite core capabilities.

Core claim

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. Alignment deltas concentrate in the read pathway (W_Q, W_K), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway (W_O, W_2) relative to the prediction subspace defined by the unembedding. The explanation is anisotropic gradient accumulation: updates to a matrix W are sums of outer products δ_t a_t^⊤, inheriting directional structure from the side with concentrated covariance. For read-pathway matrices the input activation a_t has spiked covariance from pretraining, producing objective-agnostic concentration

What carries the argument

The relative-subspace-fraction probe that measures how weight deltas align with residual-stream activation subspaces and the prediction subspace, together with the outer-product structure of gradient updates.

If this is right

  • Alignment deltas inherit directional structure primarily from input activations in read matrices due to their spiked covariance.
  • Cross-entropy pretraining induces prediction geometry specifically in the write pathways.
  • Alignment objectives typically add little additional write-side concentration beyond pretraining.
  • The pattern is supported by within-checkpoint trajectories, graded contrastive controls, and rank-1 interventions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alignment may work by modulating how the model reads existing patterns rather than rewriting its core knowledge.
  • Methods that target read-pathway directions could achieve alignment with smaller overall weight changes.
  • Understanding this read-write asymmetry might help design alignment procedures that better preserve pre-trained capabilities.

Load-bearing premise

The relative-subspace-fraction probe accurately captures the alignment of weight changes with activation subspaces and the unembedding-defined prediction subspace.

What would settle it

A direct measurement showing that alignment weight deltas are equally isotropic or anisotropic in both read and write pathways, or that a closed-form rank-1 update along principal activation directions fails to produce the expected change in model behavior during alignment.

Figures

Figures reproduced from arXiv: 2605.16600 by Eli-Shaoul Khedouri, Keiran Thompson, Valeria Ruscio.

Figure 1
Figure 1. Figure 1: Anisotropic gradient accumulation separates transformer read and write pathways. Read [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $\delta_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $\delta_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that cross-entropy pretraining and preference alignment produce geometrically distinct updates to the same Transformer weights. A relative-subspace-fraction probe is used to demonstrate that alignment deltas concentrate in the read pathways (W_Q, W_K) along principal directions of attention-input activations, while remaining near-isotropic in the write pathways (W_O, W_2) relative to the prediction subspace defined by the unembedding matrix. The asymmetry is explained via anisotropic gradient accumulation, where weight updates are sums of outer products δ_t a_t^⊤; read-path concentration arises from spiked activation covariances (objective-agnostic), while write-path behavior depends on loss-induced anisotropy in δ_t, with alignment adding little further concentration beyond pretraining. Supporting evidence includes within-checkpoint trajectories, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls.

Significance. If the central geometric claims hold, the work supplies a mechanistic account of weight-space asymmetry between pretraining and alignment that could guide more targeted alignment methods and model editing. The within-checkpoint analysis, graded controls, and closed-form interventions constitute reproducible, falsifiable elements that strengthen the causal interpretation beyond purely observational results.

major comments (2)
  1. [Explanation of anisotropic gradient accumulation] Explanation of anisotropic gradient accumulation (around the outer-product update rule): The claim that alignment losses produce less anisotropic δ_t than cross-entropy (thereby explaining write-pathway isotropy) is load-bearing for the mechanistic account, yet the manuscript reports no direct measurements of the covariance spectrum or effective rank of δ_t on the write side during alignment trajectories. The contrast therefore remains an inference from observed weight deltas rather than a direct test of the proposed source of anisotropy.
  2. [Probe definition] Relative-subspace-fraction probe definition and validation: The probe is used to quantify concentration of deltas with residual-stream activation subspaces and the unembedding-defined prediction subspace; however, its robustness to alternative subspace constructions or normalization choices is not fully characterized, which affects the reliability of the reported read/write asymmetry.
minor comments (2)
  1. [Methods] Clarify the precise mathematical definition of the relative-subspace-fraction (including any averaging or normalization over tokens or layers) in the methods.
  2. [Figures] Add error bars or statistical tests to the trajectory and control plots to support the reported differences between pretraining and alignment phases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the mechanistic claims in our manuscript. We address each major comment below and describe the revisions we will make.

read point-by-point responses
  1. Referee: [Explanation of anisotropic gradient accumulation] Explanation of anisotropic gradient accumulation (around the outer-product update rule): The claim that alignment losses produce less anisotropic δ_t than cross-entropy (thereby explaining write-pathway isotropy) is load-bearing for the mechanistic account, yet the manuscript reports no direct measurements of the covariance spectrum or effective rank of δ_t on the write side during alignment trajectories. The contrast therefore remains an inference from observed weight deltas rather than a direct test of the proposed source of anisotropy.

    Authors: We agree that direct measurements of the covariance spectrum and effective rank of δ_t during alignment trajectories would strengthen the mechanistic account and move beyond inference from weight deltas alone. Our current evidence relies on within-checkpoint trajectories and graded contrastive controls, which show the resulting write-pathway isotropy but do not isolate the upstream gradient anisotropy. In the revised manuscript we will add explicit computations of the effective rank and leading eigenvalues of δ_t for write-pathway matrices across both pretraining and alignment phases, using the same models and checkpoints as the main experiments. revision: yes

  2. Referee: [Probe definition] Relative-subspace-fraction probe definition and validation: The probe is used to quantify concentration of deltas with residual-stream activation subspaces and the unembedding-defined prediction subspace; however, its robustness to alternative subspace constructions or normalization choices is not fully characterized, which affects the reliability of the reported read/write asymmetry.

    Authors: We acknowledge that the robustness of the relative-subspace-fraction probe to alternative subspace constructions and normalization choices has not been fully characterized. To address this, the revision will include a sensitivity analysis that varies the number of principal components retained for the activation subspaces, tests alternative definitions of the prediction subspace (e.g., using the top singular vectors of the unembedding), and compares results under different normalization schemes for the deltas. These checks will be reported alongside the main figures to confirm that the read/write asymmetry is stable under reasonable variations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on external GD outer-product rule plus independent experiments

full rationale

The paper observes geometric asymmetry in weight deltas between pretraining and alignment, then accounts for it via the standard fact that SGD updates are sums of outer products δ_t a_t^⊤ (a general property of back-propagation, not fitted or redefined inside the paper). Read-pathway concentration follows from the known spiked covariance of residual-stream activations in trained transformers; write-pathway isotropy follows from the claim that alignment losses add little further anisotropy to δ_t. This is tested rather than assumed by construction: the authors supply within-checkpoint trajectories, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched-direction controls. The relative-subspace-fraction probe is a measurement tool whose definitional choices do not force the reported concentration pattern. No self-citations, uniqueness theorems, or fitted parameters are load-bearing; the central claim therefore remains externally grounded and experimentally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard back-propagation update rule and on the assumption that the relative-subspace-fraction probe faithfully measures directional alignment; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • standard math Weight updates under gradient descent are sums of outer products between upstream gradients and input activations.
    This identity is invoked to derive the anisotropic accumulation explanation for read versus write pathway geometry.

pith-pipeline@v0.9.0 · 5781 in / 1504 out tokens · 100613 ms · 2026-05-20T20:28:18.191130+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 7 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Simplified and generalized masked diffusion for discrete data , author=. Advances in neural information processing systems , volume=

  2. [2]

    First conference on language modeling , year=

    Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=

  3. [3]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  4. [4]

    Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

    Geva, Mor and Caciularu, Avi and Wang, Kevin and Goldberg, Yoav. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.3

  5. [5]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  6. [6]

    2020 , howpublished =

    nostalgebraist , title =. 2020 , howpublished =

  7. [7]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

  8. [8]

    Advances in neural information processing systems , volume=

    Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

  9. [9]

    Mass-Editing Memory in a Transformer

    Mass-editing memory in a transformer , author=. arXiv preprint arXiv:2210.07229 , year=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Leace: Perfect linear concept erasure in closed form , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    International Conference on Machine Learning , pages=

    Linear adversarial concept erasure , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  12. [12]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

  13. [13]

    Journal of Machine Learning Research , volume=

    Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning , author=. Journal of Machine Learning Research , volume=

  14. [14]

    rank-deficiency of ∆W

    The truth is in there: Improving reasoning in language models with layer-selective rank reduction , author=. arXiv preprint arXiv:2312.13558 , year=

  15. [15]

    Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

    Language model compression with weighted low-rank factorization , author=. arXiv preprint arXiv:2207.00112 , year=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    The emergence of essential sparsity in large pre-trained models: The weights that matter , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    The Eleventh International Conference on Learning Representations , year=

    Editing models with task arithmetic , author=. The Eleventh International Conference on Learning Representations , year=

  18. [18]

    2023 , url=

    Prateek Yadav and Derek Tam and Leshem Choshen and Colin Raffel and Mohit Bansal , booktitle=. 2023 , url=

  19. [19]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    What are you sinking? A geometric approach on attention sink , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  20. [20]

    2023 , url=

    Attention-likelihood relationship in Transformers , author=. 2023 , url=

  21. [21]

    International conference on machine learning , pages=

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=

  22. [22]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  23. [23]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Orpo: Monolithic preference optimization without reference model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  24. [24]

    arXiv preprint arXiv:2603.04948 , year=

    nabla-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space , author=. arXiv preprint arXiv:2603.04948 , year=

  25. [25]

    International conference on machine learning , pages=

    Pythia: A suite for analyzing large language models across training and scaling , author=. International conference on machine learning , pages=. 2023 , organization=

  26. [26]

    Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

  27. [27]

    Natural Gradient Works Efficiently in Learning , year=

    Amari, Shun-ichi , journal=. Natural Gradient Works Efficiently in Learning , year=

  28. [28]

    Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

    Fisher Information and Natural Gradient Learning in Random Deep Networks , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

  29. [29]

    Soudry, Daniel and Hoffer, Elad and Nacson, Mor Shpigel and Gunasekar, Suriya and Srebro, Nathan , title =. J. Mach. Learn. Res. , month = jan, pages =. 2018 , issue_date =

  30. [30]

    Vardan Papyan and X. Y. Han and David L. Donoho , title =. CoRR , volume =. 2020 , url =. 2008.08186 , timestamp =

  31. [31]

    Cohen , booktitle=

    Zhilin Yang and Zihang Dai and Ruslan Salakhutdinov and William W. Cohen , booktitle=. Breaking the Softmax Bottleneck: A High-Rank. 2018 , url=

  32. [32]

    2024 , eprint=

    Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , author=. 2024 , eprint=

  33. [33]

    2024 , eprint=

    ReFT: Representation Finetuning for Language Models , author=. 2024 , eprint=

  34. [34]

    2025 , eprint=

    Robust LLM safeguarding via refusal feature adversarial training , author=. 2025 , eprint=

  35. [35]

    2025 , eprint=

    The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author=. 2025 , eprint=

  36. [36]

    International Conference on Learning Representations , year=

    Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

  37. [37]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

  38. [38]

    OpenAssistant Conversations - Democratizing Large Language Model Alignment , url =

    K\". OpenAssistant Conversations - Democratizing Large Language Model Alignment , url =. Advances in Neural Information Processing Systems , editor =

  39. [39]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  40. [40]

    Linguistic Collapse: Neural Collapse in (Large) Language Models , url =

    Wu, Robert and Papyan, Vardan , booktitle =. Linguistic Collapse: Neural Collapse in (Large) Language Models , url =. doi:10.52202/079017-4366 , editor =

  41. [41]

    Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

    Breaking the softmax bottleneck: A high-rank RNN language model , author=. arXiv preprint arXiv:1711.03953 , year=

  42. [42]

    Distill , year =

    Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

  43. [43]

    In-context Learning and Induction Heads

    In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

  44. [44]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in

    Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

  45. [45]

    2023 , journal=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

  46. [46]

    Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

    Ravfogel, Shauli and Elazar, Yanai and Gonen, Hila and Twiton, Michael and Goldberg, Yoav. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.647

  47. [47]

    Steering Language Models With Activation Engineering

    Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

  48. [48]

    Steering Llama 2 via Contrastive Activation Addition

    Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

  49. [49]

    arXiv preprint arXiv:2110.11309 , year=

    Fast model editing at scale , author=. arXiv preprint arXiv:2110.11309 , year=

  50. [50]

    International Conference on Learning Representations , year=

    Editable Neural Networks , author=. International Conference on Learning Representations , year=

  51. [51]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  52. [52]

    The Eleventh International Conference on Learning Representations , year=

    Progress measures for grokking via mechanistic interpretability , author=. The Eleventh International Conference on Learning Representations , year=

  53. [53]

    Language Models Implement Simple W ord2 V ec-style Vector Arithmetic

    Merullo, Jack and Eickhoff, Carsten and Pavlick, Ellie. Language Models Implement Simple W ord2 V ec-style Vector Arithmetic. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.281

  54. [54]

    arXiv preprint arXiv:1905.12213 , year=

    Where is the information in a deep neural network? , author=. arXiv preprint arXiv:1905.12213 , year=

  55. [55]

    International conference on machine learning , pages=

    Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

  56. [56]

    Advances in Neural Information Processing Systems , volume=

    Linguistic collapse: Neural collapse in (large) language models , author=. Advances in Neural Information Processing Systems , volume=

  57. [57]

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

  58. [58]

    The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter , url =

    JAISWAL, AJAY and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang "Atlas" , booktitle =. The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter , url =

  59. [59]

    Forty-first International Conference on Machine Learning , year=

    Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=