Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

Eli-Shaoul Khedouri; Keiran Thompson; Valeria Ruscio

arxiv: 2605.16600 · v1 · pith:CZWASWXRnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI· cs.CL

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

Valeria Ruscio , Eli-Shaoul Khedouri , Keiran Thompson This is my paper

Pith reviewed 2026-05-20 20:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords transformer weight spacepretraining alignment asymmetrygradient accumulationread write pathwaysresidual streamattention mechanismsprediction subspaceouter product updates

0 comments

The pith

Pretraining imprints prediction geometry on the write pathways of transformer weights, while alignment concentrates changes in the read pathways.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that cross-entropy pretraining and preference alignment update the same transformer weights but produce geometrically distinct patterns. Alignment weight changes focus on the read components of attention, aligning with the main directions of input activations, whereas the write components stay spread out relative to the model's prediction directions. This pattern arises because updates are formed as outer products of gradients and activations, inheriting anisotropy from whichever side carries more structure after pretraining. A sympathetic reader would care because this geometry could clarify why alignment requires far less data than pretraining and why it tends to preserve rather than overwrite core capabilities.

Core claim

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. Alignment deltas concentrate in the read pathway (W_Q, W_K), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway (W_O, W_2) relative to the prediction subspace defined by the unembedding. The explanation is anisotropic gradient accumulation: updates to a matrix W are sums of outer products δ_t a_t^⊤, inheriting directional structure from the side with concentrated covariance. For read-pathway matrices the input activation a_t has spiked covariance from pretraining, producing objective-agnostic concentration

What carries the argument

The relative-subspace-fraction probe that measures how weight deltas align with residual-stream activation subspaces and the prediction subspace, together with the outer-product structure of gradient updates.

If this is right

Alignment deltas inherit directional structure primarily from input activations in read matrices due to their spiked covariance.
Cross-entropy pretraining induces prediction geometry specifically in the write pathways.
Alignment objectives typically add little additional write-side concentration beyond pretraining.
The pattern is supported by within-checkpoint trajectories, graded contrastive controls, and rank-1 interventions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment may work by modulating how the model reads existing patterns rather than rewriting its core knowledge.
Methods that target read-pathway directions could achieve alignment with smaller overall weight changes.
Understanding this read-write asymmetry might help design alignment procedures that better preserve pre-trained capabilities.

Load-bearing premise

The relative-subspace-fraction probe accurately captures the alignment of weight changes with activation subspaces and the unembedding-defined prediction subspace.

What would settle it

A direct measurement showing that alignment weight deltas are equally isotropic or anisotropic in both read and write pathways, or that a closed-form rank-1 update along principal activation directions fails to produce the expected change in model behavior during alignment.

Figures

Figures reproduced from arXiv: 2605.16600 by Eli-Shaoul Khedouri, Keiran Thompson, Valeria Ruscio.

read the original abstract

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $\delta_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $\delta_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pretraining and alignment leave geometrically distinct traces in transformer weights, with alignment concentrating changes in read pathways.

read the letter

The main takeaway is that pretraining and alignment update the same transformer weights but produce different geometric patterns. Alignment deltas line up with principal directions in the attention input activations for the read matrices, while the write matrices stay nearer to isotropic relative to the prediction subspace from the unembedding matrix. This asymmetry is the core observation the paper puts forward. What is new is the relative-subspace-fraction probe that quantifies how weight deltas overlap with residual-stream activation subspaces and the prediction subspace. They support the pattern with a within-checkpoint trajectory analysis, a graded contrastive objective control, and a closed-form rank-1 intervention with matched direction controls. These elements give the claim some causal grounding beyond simple observation. The account ties directly to the outer-product form of gradient updates without introducing fitted parameters, which keeps the explanation grounded in standard gradient descent mechanics. For read-pathway matrices the input activations already carry spiked covariance in trained models, so the concentration appears objective-agnostic. For write-pathway matrices the structure depends on the upstream gradient side, and the paper argues alignment losses add less anisotropy than cross-entropy pretraining. One soft spot is the missing direct check on gradient anisotropy during alignment. The stress-test concern holds: the paper does not report covariance spectra or effective rank of the delta vectors on the write side under alignment objectives. The isotropy claim for writes is inferred from the observed weight deltas rather than tested head-on. The read-side concentration and overall asymmetry still stand as reported. Readers working on mechanistic interpretability or alignment design would get value from the geometric handle and the probe. The work shows clear engagement with the update rule and supplies reproducible-style interventions, so it deserves a serious referee even if the gradient measurement gap needs addressing in revision. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that cross-entropy pretraining and preference alignment produce geometrically distinct updates to the same Transformer weights. A relative-subspace-fraction probe is used to demonstrate that alignment deltas concentrate in the read pathways (W_Q, W_K) along principal directions of attention-input activations, while remaining near-isotropic in the write pathways (W_O, W_2) relative to the prediction subspace defined by the unembedding matrix. The asymmetry is explained via anisotropic gradient accumulation, where weight updates are sums of outer products δ_t a_t^⊤; read-path concentration arises from spiked activation covariances (objective-agnostic), while write-path behavior depends on loss-induced anisotropy in δ_t, with alignment adding little further concentration beyond pretraining. Supporting evidence includes within-checkpoint trajectories, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls.

Significance. If the central geometric claims hold, the work supplies a mechanistic account of weight-space asymmetry between pretraining and alignment that could guide more targeted alignment methods and model editing. The within-checkpoint analysis, graded controls, and closed-form interventions constitute reproducible, falsifiable elements that strengthen the causal interpretation beyond purely observational results.

major comments (2)

[Explanation of anisotropic gradient accumulation] Explanation of anisotropic gradient accumulation (around the outer-product update rule): The claim that alignment losses produce less anisotropic δ_t than cross-entropy (thereby explaining write-pathway isotropy) is load-bearing for the mechanistic account, yet the manuscript reports no direct measurements of the covariance spectrum or effective rank of δ_t on the write side during alignment trajectories. The contrast therefore remains an inference from observed weight deltas rather than a direct test of the proposed source of anisotropy.
[Probe definition] Relative-subspace-fraction probe definition and validation: The probe is used to quantify concentration of deltas with residual-stream activation subspaces and the unembedding-defined prediction subspace; however, its robustness to alternative subspace constructions or normalization choices is not fully characterized, which affects the reliability of the reported read/write asymmetry.

minor comments (2)

[Methods] Clarify the precise mathematical definition of the relative-subspace-fraction (including any averaging or normalization over tokens or layers) in the methods.
[Figures] Add error bars or statistical tests to the trajectory and control plots to support the reported differences between pretraining and alignment phases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the mechanistic claims in our manuscript. We address each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [Explanation of anisotropic gradient accumulation] Explanation of anisotropic gradient accumulation (around the outer-product update rule): The claim that alignment losses produce less anisotropic δ_t than cross-entropy (thereby explaining write-pathway isotropy) is load-bearing for the mechanistic account, yet the manuscript reports no direct measurements of the covariance spectrum or effective rank of δ_t on the write side during alignment trajectories. The contrast therefore remains an inference from observed weight deltas rather than a direct test of the proposed source of anisotropy.

Authors: We agree that direct measurements of the covariance spectrum and effective rank of δ_t during alignment trajectories would strengthen the mechanistic account and move beyond inference from weight deltas alone. Our current evidence relies on within-checkpoint trajectories and graded contrastive controls, which show the resulting write-pathway isotropy but do not isolate the upstream gradient anisotropy. In the revised manuscript we will add explicit computations of the effective rank and leading eigenvalues of δ_t for write-pathway matrices across both pretraining and alignment phases, using the same models and checkpoints as the main experiments. revision: yes
Referee: [Probe definition] Relative-subspace-fraction probe definition and validation: The probe is used to quantify concentration of deltas with residual-stream activation subspaces and the unembedding-defined prediction subspace; however, its robustness to alternative subspace constructions or normalization choices is not fully characterized, which affects the reliability of the reported read/write asymmetry.

Authors: We acknowledge that the robustness of the relative-subspace-fraction probe to alternative subspace constructions and normalization choices has not been fully characterized. To address this, the revision will include a sensitivity analysis that varies the number of principal components retained for the activation subspaces, tests alternative definitions of the prediction subspace (e.g., using the top singular vectors of the unembedding), and compares results under different normalization schemes for the deltas. These checks will be reported alongside the main figures to confirm that the read/write asymmetry is stable under reasonable variations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on external GD outer-product rule plus independent experiments

full rationale

The paper observes geometric asymmetry in weight deltas between pretraining and alignment, then accounts for it via the standard fact that SGD updates are sums of outer products δ_t a_t^⊤ (a general property of back-propagation, not fitted or redefined inside the paper). Read-pathway concentration follows from the known spiked covariance of residual-stream activations in trained transformers; write-pathway isotropy follows from the claim that alignment losses add little further anisotropy to δ_t. This is tested rather than assumed by construction: the authors supply within-checkpoint trajectories, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched-direction controls. The relative-subspace-fraction probe is a measurement tool whose definitional choices do not force the reported concentration pattern. No self-citations, uniqueness theorems, or fitted parameters are load-bearing; the central claim therefore remains externally grounded and experimentally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard back-propagation update rule and on the assumption that the relative-subspace-fraction probe faithfully measures directional alignment; no free parameters or new entities are introduced in the abstract.

axioms (1)

standard math Weight updates under gradient descent are sums of outer products between upstream gradients and input activations.
This identity is invoked to derive the anisotropic accumulation explanation for read versus write pathway geometry.

pith-pipeline@v0.9.0 · 5781 in / 1504 out tokens · 100613 ms · 2026-05-20T20:28:18.191130+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

updates to a matrix W are sums of outer products δ_t a_t^⊤, and inherit directional structure from whichever side has concentrated covariance
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Cross-entropy pretraining is the canonical sharp-gradient regime... simplex-vertex attractor

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 7 internal anchors

[1]

Advances in neural information processing systems , volume=

Simplified and generalized masked diffusion for discrete data , author=. Advances in neural information processing systems , volume=

work page
[2]

First conference on language modeling , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=

work page
[3]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[4]

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Geva, Mor and Caciularu, Avi and Wang, Kevin and Goldberg, Yoav. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.3

work page doi:10.18653/v1/2022.emnlp-main.3 2022
[5]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[6]

2020 , howpublished =

nostalgebraist , title =. 2020 , howpublished =

work page 2020
[7]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

work page
[9]

Mass-Editing Memory in a Transformer

Mass-editing memory in a transformer , author=. arXiv preprint arXiv:2210.07229 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Advances in Neural Information Processing Systems , volume=

Leace: Perfect linear concept erasure in closed form , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

International Conference on Machine Learning , pages=

Linear adversarial concept erasure , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[12]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Journal of Machine Learning Research , volume=

Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning , author=. Journal of Machine Learning Research , volume=

work page
[14]

rank-deficiency of ∆W

The truth is in there: Improving reasoning in language models with layer-selective rank reduction , author=. arXiv preprint arXiv:2312.13558 , year=

work page arXiv
[15]

Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

Language model compression with weighted low-rank factorization , author=. arXiv preprint arXiv:2207.00112 , year=

work page arXiv
[16]

Advances in Neural Information Processing Systems , volume=

The emergence of essential sparsity in large pre-trained models: The weights that matter , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

The Eleventh International Conference on Learning Representations , year=

Editing models with task arithmetic , author=. The Eleventh International Conference on Learning Representations , year=

work page
[18]

2023 , url=

Prateek Yadav and Derek Tam and Leshem Choshen and Colin Raffel and Mohit Bansal , booktitle=. 2023 , url=

work page 2023
[19]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

What are you sinking? A geometric approach on attention sink , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[20]

2023 , url=

Attention-likelihood relationship in Transformers , author=. 2023 , url=

work page 2023
[21]

International conference on machine learning , pages=

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022
[22]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[23]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Orpo: Monolithic preference optimization without reference model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[24]

arXiv preprint arXiv:2603.04948 , year=

nabla-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space , author=. arXiv preprint arXiv:2603.04948 , year=

work page arXiv
[25]

International conference on machine learning , pages=

Pythia: A suite for analyzing large language models across training and scaling , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[26]

Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019
[27]

Natural Gradient Works Efficiently in Learning , year=

Amari, Shun-ichi , journal=. Natural Gradient Works Efficiently in Learning , year=

work page
[28]

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

Fisher Information and Natural Gradient Learning in Random Deep Networks , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

work page 2019
[29]

Soudry, Daniel and Hoffer, Elad and Nacson, Mor Shpigel and Gunasekar, Suriya and Srebro, Nathan , title =. J. Mach. Learn. Res. , month = jan, pages =. 2018 , issue_date =

work page 2018
[30]

Vardan Papyan and X. Y. Han and David L. Donoho , title =. CoRR , volume =. 2020 , url =. 2008.08186 , timestamp =

work page arXiv 2020
[31]

Cohen , booktitle=

Zhilin Yang and Zihang Dai and Ruslan Salakhutdinov and William W. Cohen , booktitle=. Breaking the Softmax Bottleneck: A High-Rank. 2018 , url=

work page 2018
[32]

2024 , eprint=

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , author=. 2024 , eprint=

work page 2024
[33]

2024 , eprint=

ReFT: Representation Finetuning for Language Models , author=. 2024 , eprint=

work page 2024
[34]

2025 , eprint=

Robust LLM safeguarding via refusal feature adversarial training , author=. 2025 , eprint=

work page 2025
[35]

2025 , eprint=

The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author=. 2025 , eprint=

work page 2025
[36]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

work page
[37]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

work page
[38]

OpenAssistant Conversations - Democratizing Large Language Model Alignment , url =

K\". OpenAssistant Conversations - Democratizing Large Language Model Alignment , url =. Advances in Neural Information Processing Systems , editor =

work page
[39]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[40]

Linguistic Collapse: Neural Collapse in (Large) Language Models , url =

Wu, Robert and Papyan, Vardan , booktitle =. Linguistic Collapse: Neural Collapse in (Large) Language Models , url =. doi:10.52202/079017-4366 , editor =

work page doi:10.52202/079017-4366
[41]

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Breaking the softmax bottleneck: A high-rank RNN language model , author=. arXiv preprint arXiv:1711.03953 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

work page
[43]

In-context Learning and Induction Heads

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

work page 2023
[45]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023
[46]

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

Ravfogel, Shauli and Elazar, Yanai and Gonen, Hila and Twiton, Michael and Goldberg, Yoav. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.647

work page doi:10.18653/v1/2020.acl-main.647 2020
[47]

Steering Language Models With Activation Engineering

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024
[49]

arXiv preprint arXiv:2110.11309 , year=

Fast model editing at scale , author=. arXiv preprint arXiv:2110.11309 , year=

work page arXiv
[50]

International Conference on Learning Representations , year=

Editable Neural Networks , author=. International Conference on Learning Representations , year=

work page
[51]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

work page
[52]

The Eleventh International Conference on Learning Representations , year=

Progress measures for grokking via mechanistic interpretability , author=. The Eleventh International Conference on Learning Representations , year=

work page
[53]

Language Models Implement Simple W ord2 V ec-style Vector Arithmetic

Merullo, Jack and Eickhoff, Carsten and Pavlick, Ellie. Language Models Implement Simple W ord2 V ec-style Vector Arithmetic. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.281

work page doi:10.18653/v1/2024.naacl-long.281 2024
[54]

arXiv preprint arXiv:1905.12213 , year=

Where is the information in a deep neural network? , author=. arXiv preprint arXiv:1905.12213 , year=

work page arXiv 1905
[55]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[56]

Advances in Neural Information Processing Systems , volume=

Linguistic collapse: Neural collapse in (large) language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[57]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page internal anchor Pith review arXiv 1904
[58]

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter , url =

JAISWAL, AJAY and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang "Atlas" , booktitle =. The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter , url =

work page
[59]

Forty-first International Conference on Machine Learning , year=

Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=

work page

[1] [1]

Advances in neural information processing systems , volume=

Simplified and generalized masked diffusion for discrete data , author=. Advances in neural information processing systems , volume=

work page

[2] [2]

First conference on language modeling , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=

work page

[3] [3]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021

[4] [4]

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Geva, Mor and Caciularu, Avi and Wang, Kevin and Goldberg, Yoav. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.3

work page doi:10.18653/v1/2022.emnlp-main.3 2022

[5] [5]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021

[6] [6]

2020 , howpublished =

nostalgebraist , title =. 2020 , howpublished =

work page 2020

[7] [7]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

work page

[9] [9]

Mass-Editing Memory in a Transformer

Mass-editing memory in a transformer , author=. arXiv preprint arXiv:2210.07229 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Advances in Neural Information Processing Systems , volume=

Leace: Perfect linear concept erasure in closed form , author=. Advances in Neural Information Processing Systems , volume=

work page

[11] [11]

International Conference on Machine Learning , pages=

Linear adversarial concept erasure , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022

[12] [12]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Journal of Machine Learning Research , volume=

Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning , author=. Journal of Machine Learning Research , volume=

work page

[14] [14]

rank-deficiency of ∆W

The truth is in there: Improving reasoning in language models with layer-selective rank reduction , author=. arXiv preprint arXiv:2312.13558 , year=

work page arXiv

[15] [15]

Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

Language model compression with weighted low-rank factorization , author=. arXiv preprint arXiv:2207.00112 , year=

work page arXiv

[16] [16]

Advances in Neural Information Processing Systems , volume=

The emergence of essential sparsity in large pre-trained models: The weights that matter , author=. Advances in Neural Information Processing Systems , volume=

work page

[17] [17]

The Eleventh International Conference on Learning Representations , year=

Editing models with task arithmetic , author=. The Eleventh International Conference on Learning Representations , year=

work page

[18] [18]

2023 , url=

Prateek Yadav and Derek Tam and Leshem Choshen and Colin Raffel and Mohit Bansal , booktitle=. 2023 , url=

work page 2023

[19] [19]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

What are you sinking? A geometric approach on attention sink , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[20] [20]

2023 , url=

Attention-likelihood relationship in Transformers , author=. 2023 , url=

work page 2023

[21] [21]

International conference on machine learning , pages=

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022

[22] [22]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page

[23] [23]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Orpo: Monolithic preference optimization without reference model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[24] [24]

arXiv preprint arXiv:2603.04948 , year=

nabla-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space , author=. arXiv preprint arXiv:2603.04948 , year=

work page arXiv

[25] [25]

International conference on machine learning , pages=

Pythia: A suite for analyzing large language models across training and scaling , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[26] [26]

Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019

[27] [27]

Natural Gradient Works Efficiently in Learning , year=

Amari, Shun-ichi , journal=. Natural Gradient Works Efficiently in Learning , year=

work page

[28] [28]

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

Fisher Information and Natural Gradient Learning in Random Deep Networks , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

work page 2019

[29] [29]

Soudry, Daniel and Hoffer, Elad and Nacson, Mor Shpigel and Gunasekar, Suriya and Srebro, Nathan , title =. J. Mach. Learn. Res. , month = jan, pages =. 2018 , issue_date =

work page 2018

[30] [30]

Vardan Papyan and X. Y. Han and David L. Donoho , title =. CoRR , volume =. 2020 , url =. 2008.08186 , timestamp =

work page arXiv 2020

[31] [31]

Cohen , booktitle=

Zhilin Yang and Zihang Dai and Ruslan Salakhutdinov and William W. Cohen , booktitle=. Breaking the Softmax Bottleneck: A High-Rank. 2018 , url=

work page 2018

[32] [32]

2024 , eprint=

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , author=. 2024 , eprint=

work page 2024

[33] [33]

2024 , eprint=

ReFT: Representation Finetuning for Language Models , author=. 2024 , eprint=

work page 2024

[34] [34]

2025 , eprint=

Robust LLM safeguarding via refusal feature adversarial training , author=. 2025 , eprint=

work page 2025

[35] [35]

2025 , eprint=

The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author=. 2025 , eprint=

work page 2025

[36] [36]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

work page

[37] [37]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

work page

[38] [38]

OpenAssistant Conversations - Democratizing Large Language Model Alignment , url =

K\". OpenAssistant Conversations - Democratizing Large Language Model Alignment , url =. Advances in Neural Information Processing Systems , editor =

work page

[39] [39]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023

[40] [40]

Linguistic Collapse: Neural Collapse in (Large) Language Models , url =

Wu, Robert and Papyan, Vardan , booktitle =. Linguistic Collapse: Neural Collapse in (Large) Language Models , url =. doi:10.52202/079017-4366 , editor =

work page doi:10.52202/079017-4366

[41] [41]

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Breaking the softmax bottleneck: A high-rank RNN language model , author=. arXiv preprint arXiv:1711.03953 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

work page

[43] [43]

In-context Learning and Induction Heads

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

work page 2023

[45] [45]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023

[46] [46]

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

Ravfogel, Shauli and Elazar, Yanai and Gonen, Hila and Twiton, Michael and Goldberg, Yoav. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.647

work page doi:10.18653/v1/2020.acl-main.647 2020

[47] [47]

Steering Language Models With Activation Engineering

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024

[49] [49]

arXiv preprint arXiv:2110.11309 , year=

Fast model editing at scale , author=. arXiv preprint arXiv:2110.11309 , year=

work page arXiv

[50] [50]

International Conference on Learning Representations , year=

Editable Neural Networks , author=. International Conference on Learning Representations , year=

work page

[51] [51]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

work page

[52] [52]

The Eleventh International Conference on Learning Representations , year=

Progress measures for grokking via mechanistic interpretability , author=. The Eleventh International Conference on Learning Representations , year=

work page

[53] [53]

Language Models Implement Simple W ord2 V ec-style Vector Arithmetic

Merullo, Jack and Eickhoff, Carsten and Pavlick, Ellie. Language Models Implement Simple W ord2 V ec-style Vector Arithmetic. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.281

work page doi:10.18653/v1/2024.naacl-long.281 2024

[54] [54]

arXiv preprint arXiv:1905.12213 , year=

Where is the information in a deep neural network? , author=. arXiv preprint arXiv:1905.12213 , year=

work page arXiv 1905

[55] [55]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015

[56] [56]

Advances in Neural Information Processing Systems , volume=

Linguistic collapse: Neural collapse in (large) language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[57] [57]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page internal anchor Pith review arXiv 1904

[58] [58]

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter , url =

JAISWAL, AJAY and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang "Atlas" , booktitle =. The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter , url =

work page

[59] [59]

Forty-first International Conference on Machine Learning , year=

Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=

work page