Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
Pith reviewed 2026-05-20 20:28 UTC · model grok-4.3
The pith
Pretraining imprints prediction geometry on the write pathways of transformer weights, while alignment concentrates changes in the read pathways.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. Alignment deltas concentrate in the read pathway (W_Q, W_K), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway (W_O, W_2) relative to the prediction subspace defined by the unembedding. The explanation is anisotropic gradient accumulation: updates to a matrix W are sums of outer products δ_t a_t^⊤, inheriting directional structure from the side with concentrated covariance. For read-pathway matrices the input activation a_t has spiked covariance from pretraining, producing objective-agnostic concentration
What carries the argument
The relative-subspace-fraction probe that measures how weight deltas align with residual-stream activation subspaces and the prediction subspace, together with the outer-product structure of gradient updates.
If this is right
- Alignment deltas inherit directional structure primarily from input activations in read matrices due to their spiked covariance.
- Cross-entropy pretraining induces prediction geometry specifically in the write pathways.
- Alignment objectives typically add little additional write-side concentration beyond pretraining.
- The pattern is supported by within-checkpoint trajectories, graded contrastive controls, and rank-1 interventions.
Where Pith is reading between the lines
- Alignment may work by modulating how the model reads existing patterns rather than rewriting its core knowledge.
- Methods that target read-pathway directions could achieve alignment with smaller overall weight changes.
- Understanding this read-write asymmetry might help design alignment procedures that better preserve pre-trained capabilities.
Load-bearing premise
The relative-subspace-fraction probe accurately captures the alignment of weight changes with activation subspaces and the unembedding-defined prediction subspace.
What would settle it
A direct measurement showing that alignment weight deltas are equally isotropic or anisotropic in both read and write pathways, or that a closed-form rank-1 update along principal activation directions fails to produce the expected change in model behavior during alignment.
Figures
read the original abstract
Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $\delta_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $\delta_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that cross-entropy pretraining and preference alignment produce geometrically distinct updates to the same Transformer weights. A relative-subspace-fraction probe is used to demonstrate that alignment deltas concentrate in the read pathways (W_Q, W_K) along principal directions of attention-input activations, while remaining near-isotropic in the write pathways (W_O, W_2) relative to the prediction subspace defined by the unembedding matrix. The asymmetry is explained via anisotropic gradient accumulation, where weight updates are sums of outer products δ_t a_t^⊤; read-path concentration arises from spiked activation covariances (objective-agnostic), while write-path behavior depends on loss-induced anisotropy in δ_t, with alignment adding little further concentration beyond pretraining. Supporting evidence includes within-checkpoint trajectories, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls.
Significance. If the central geometric claims hold, the work supplies a mechanistic account of weight-space asymmetry between pretraining and alignment that could guide more targeted alignment methods and model editing. The within-checkpoint analysis, graded controls, and closed-form interventions constitute reproducible, falsifiable elements that strengthen the causal interpretation beyond purely observational results.
major comments (2)
- [Explanation of anisotropic gradient accumulation] Explanation of anisotropic gradient accumulation (around the outer-product update rule): The claim that alignment losses produce less anisotropic δ_t than cross-entropy (thereby explaining write-pathway isotropy) is load-bearing for the mechanistic account, yet the manuscript reports no direct measurements of the covariance spectrum or effective rank of δ_t on the write side during alignment trajectories. The contrast therefore remains an inference from observed weight deltas rather than a direct test of the proposed source of anisotropy.
- [Probe definition] Relative-subspace-fraction probe definition and validation: The probe is used to quantify concentration of deltas with residual-stream activation subspaces and the unembedding-defined prediction subspace; however, its robustness to alternative subspace constructions or normalization choices is not fully characterized, which affects the reliability of the reported read/write asymmetry.
minor comments (2)
- [Methods] Clarify the precise mathematical definition of the relative-subspace-fraction (including any averaging or normalization over tokens or layers) in the methods.
- [Figures] Add error bars or statistical tests to the trajectory and control plots to support the reported differences between pretraining and alignment phases.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the mechanistic claims in our manuscript. We address each major comment below and describe the revisions we will make.
read point-by-point responses
-
Referee: [Explanation of anisotropic gradient accumulation] Explanation of anisotropic gradient accumulation (around the outer-product update rule): The claim that alignment losses produce less anisotropic δ_t than cross-entropy (thereby explaining write-pathway isotropy) is load-bearing for the mechanistic account, yet the manuscript reports no direct measurements of the covariance spectrum or effective rank of δ_t on the write side during alignment trajectories. The contrast therefore remains an inference from observed weight deltas rather than a direct test of the proposed source of anisotropy.
Authors: We agree that direct measurements of the covariance spectrum and effective rank of δ_t during alignment trajectories would strengthen the mechanistic account and move beyond inference from weight deltas alone. Our current evidence relies on within-checkpoint trajectories and graded contrastive controls, which show the resulting write-pathway isotropy but do not isolate the upstream gradient anisotropy. In the revised manuscript we will add explicit computations of the effective rank and leading eigenvalues of δ_t for write-pathway matrices across both pretraining and alignment phases, using the same models and checkpoints as the main experiments. revision: yes
-
Referee: [Probe definition] Relative-subspace-fraction probe definition and validation: The probe is used to quantify concentration of deltas with residual-stream activation subspaces and the unembedding-defined prediction subspace; however, its robustness to alternative subspace constructions or normalization choices is not fully characterized, which affects the reliability of the reported read/write asymmetry.
Authors: We acknowledge that the robustness of the relative-subspace-fraction probe to alternative subspace constructions and normalization choices has not been fully characterized. To address this, the revision will include a sensitivity analysis that varies the number of principal components retained for the activation subspaces, tests alternative definitions of the prediction subspace (e.g., using the top singular vectors of the unembedding), and compares results under different normalization schemes for the deltas. These checks will be reported alongside the main figures to confirm that the read/write asymmetry is stable under reasonable variations. revision: yes
Circularity Check
No circularity: derivation rests on external GD outer-product rule plus independent experiments
full rationale
The paper observes geometric asymmetry in weight deltas between pretraining and alignment, then accounts for it via the standard fact that SGD updates are sums of outer products δ_t a_t^⊤ (a general property of back-propagation, not fitted or redefined inside the paper). Read-pathway concentration follows from the known spiked covariance of residual-stream activations in trained transformers; write-pathway isotropy follows from the claim that alignment losses add little further anisotropy to δ_t. This is tested rather than assumed by construction: the authors supply within-checkpoint trajectories, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched-direction controls. The relative-subspace-fraction probe is a measurement tool whose definitional choices do not force the reported concentration pattern. No self-citations, uniqueness theorems, or fitted parameters are load-bearing; the central claim therefore remains externally grounded and experimentally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Weight updates under gradient descent are sums of outer products between upstream gradients and input activations.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
updates to a matrix W are sums of outer products δ_t a_t^⊤, and inherit directional structure from whichever side has concentrated covariance
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Cross-entropy pretraining is the canonical sharp-gradient regime... simplex-vertex attractor
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems , volume=
Simplified and generalized masked diffusion for discrete data , author=. Advances in neural information processing systems , volume=
-
[2]
First conference on language modeling , year=
Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=
-
[3]
A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=
work page 2021
-
[4]
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
Geva, Mor and Caciularu, Avi and Wang, Kevin and Goldberg, Yoav. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.3
-
[5]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2021
- [6]
-
[7]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Advances in neural information processing systems , volume=
Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=
-
[9]
Mass-Editing Memory in a Transformer
Mass-editing memory in a transformer , author=. arXiv preprint arXiv:2210.07229 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Advances in Neural Information Processing Systems , volume=
Leace: Perfect linear concept erasure in closed form , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
International Conference on Machine Learning , pages=
Linear adversarial concept erasure , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[12]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Journal of Machine Learning Research , volume=
Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning , author=. Journal of Machine Learning Research , volume=
-
[14]
The truth is in there: Improving reasoning in language models with layer-selective rank reduction , author=. arXiv preprint arXiv:2312.13558 , year=
-
[15]
Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,
Language model compression with weighted low-rank factorization , author=. arXiv preprint arXiv:2207.00112 , year=
-
[16]
Advances in Neural Information Processing Systems , volume=
The emergence of essential sparsity in large pre-trained models: The weights that matter , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
The Eleventh International Conference on Learning Representations , year=
Editing models with task arithmetic , author=. The Eleventh International Conference on Learning Representations , year=
-
[18]
Prateek Yadav and Derek Tam and Leshem Choshen and Colin Raffel and Mohit Bansal , booktitle=. 2023 , url=
work page 2023
-
[19]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
What are you sinking? A geometric approach on attention sink , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
- [20]
-
[21]
International conference on machine learning , pages=
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=
work page 2022
-
[22]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[23]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
Orpo: Monolithic preference optimization without reference model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2024
-
[24]
arXiv preprint arXiv:2603.04948 , year=
nabla-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space , author=. arXiv preprint arXiv:2603.04948 , year=
-
[25]
International conference on machine learning , pages=
Pythia: A suite for analyzing large language models across training and scaling , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[26]
Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=
work page 2019
-
[27]
Natural Gradient Works Efficiently in Learning , year=
Amari, Shun-ichi , journal=. Natural Gradient Works Efficiently in Learning , year=
-
[28]
Fisher Information and Natural Gradient Learning in Random Deep Networks , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =
work page 2019
-
[29]
Soudry, Daniel and Hoffer, Elad and Nacson, Mor Shpigel and Gunasekar, Suriya and Srebro, Nathan , title =. J. Mach. Learn. Res. , month = jan, pages =. 2018 , issue_date =
work page 2018
- [30]
-
[31]
Zhilin Yang and Zihang Dai and Ruslan Salakhutdinov and William W. Cohen , booktitle=. Breaking the Softmax Bottleneck: A High-Rank. 2018 , url=
work page 2018
-
[32]
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , author=. 2024 , eprint=
work page 2024
-
[33]
ReFT: Representation Finetuning for Language Models , author=. 2024 , eprint=
work page 2024
-
[34]
Robust LLM safeguarding via refusal feature adversarial training , author=. 2025 , eprint=
work page 2025
-
[35]
The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author=. 2025 , eprint=
work page 2025
-
[36]
International Conference on Learning Representations , year=
Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=
-
[37]
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
-
[38]
OpenAssistant Conversations - Democratizing Large Language Model Alignment , url =
K\". OpenAssistant Conversations - Democratizing Large Language Model Alignment , url =. Advances in Neural Information Processing Systems , editor =
-
[39]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[40]
Linguistic Collapse: Neural Collapse in (Large) Language Models , url =
Wu, Robert and Papyan, Vardan , booktitle =. Linguistic Collapse: Neural Collapse in (Large) Language Models , url =. doi:10.52202/079017-4366 , editor =
-
[41]
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Breaking the softmax bottleneck: A high-rank RNN language model , author=. arXiv preprint arXiv:1711.03953 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =
-
[43]
In-context Learning and Induction Heads
In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Interpretability in the Wild: a Circuit for Indirect Object Identification in
Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=
work page 2023
-
[45]
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=
work page 2023
-
[46]
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
Ravfogel, Shauli and Elazar, Yanai and Gonen, Hila and Twiton, Michael and Goldberg, Yoav. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.647
-
[47]
Steering Language Models With Activation Engineering
Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Steering Llama 2 via Contrastive Activation Addition
Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828
-
[49]
arXiv preprint arXiv:2110.11309 , year=
Fast model editing at scale , author=. arXiv preprint arXiv:2110.11309 , year=
-
[50]
International Conference on Learning Representations , year=
Editable Neural Networks , author=. International Conference on Learning Representations , year=
- [51]
-
[52]
The Eleventh International Conference on Learning Representations , year=
Progress measures for grokking via mechanistic interpretability , author=. The Eleventh International Conference on Learning Representations , year=
-
[53]
Language Models Implement Simple W ord2 V ec-style Vector Arithmetic
Merullo, Jack and Eickhoff, Carsten and Pavlick, Ellie. Language Models Implement Simple W ord2 V ec-style Vector Arithmetic. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.281
-
[54]
arXiv preprint arXiv:1905.12213 , year=
Where is the information in a deep neural network? , author=. arXiv preprint arXiv:1905.12213 , year=
-
[55]
International conference on machine learning , pages=
Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=
work page 2015
-
[56]
Advances in Neural Information Processing Systems , volume=
Linguistic collapse: Neural collapse in (large) language models , author=. Advances in Neural Information Processing Systems , volume=
-
[57]
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=
work page internal anchor Pith review arXiv 1904
-
[58]
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter , url =
JAISWAL, AJAY and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang "Atlas" , booktitle =. The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter , url =
-
[59]
Forty-first International Conference on Machine Learning , year=
Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.