Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

Jordan F. McCann

arxiv: 2605.24577 · v1 · pith:JHVAXKMBnew · submitted 2026-05-23 · 💻 cs.LG · cs.AI· cs.CL

Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

Jordan F. McCann This is my paper

Pith reviewed 2026-06-30 14:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords polymorphismrotationresidual streamsparse autoencoderssteering vectorsProcrustes alignmenttransformer interpretabilitymodel universality

0 comments

The pith

Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on SO(d_model).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that independently trained transformers perform identical computations yet store them in residual-stream coordinates that differ by a random rotation drawn from the Haar measure on the special orthogonal group. Standard sparse-autoencoder universality checks report high decoder similarity across seeds while missing that the encoder operates in the rotated frame, yielding negative explained variance on cross-seed reconstruction. A single orthogonal Procrustes matrix fitted to one batch of activations removes the rotation and restores reconstruction to within 0.025 explained variance of the within-seed ceiling at every site. The same alignment transfers diff-of-means steering vectors when their directions overlap the rotation's invariant subspace and inverts them otherwise. The account is validated from a 104k-parameter Dyck-3 model through nine Pythia-70m seeds.

Core claim

Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on SO(d_model). One matrix multiplication per model pair removes it: an orthogonal Procrustes fit on a single batch of activations transfers sparse-autoencoder feature dictionaries and steering vectors between independently trained models, with no retraining. The phenomenon is invisible to the standard SAE universality metric. Decoder-column cosine similarity matches across seeds at 98 percent while an SAE trained on one seed reconstructs another seed's activations at negative explained variance. R is Haar-distributed and the same rotation account holds across traini

What carries the argument

Polymorphism as a uniform random rotation on SO(d_model), identified and removed by an orthogonal Procrustes fit on residual-stream activations.

If this is right

Decoder columns of SAEs align across seeds while encoders require the rotation correction to achieve positive explained variance.
Steering vectors transfer cleanly when pinned by shared output weights, partially when overlapping the rotated subspace, and invert when they lie outside it.
With no shared input or output weights, all steering-vector transfers collapse to universal inversion.
The identical rotation accounts for activation differences observed across successive checkpoints of a single training run.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A cheap pre-alignment step could let feature dictionaries and circuits be reused across any number of independently trained instances of the same architecture.
Many reported failures of feature universality may trace to basis mismatch rather than genuine computational divergence.
If the Haar property persists at larger scales, interpretability pipelines could treat rotation alignment as a standard preprocessing step before circuit analysis.

Load-bearing premise

The models compute exactly the same function and differ only by a change of basis in the residual stream.

What would settle it

A new pair of independently trained models on which the fitted Procrustes rotation leaves cross-seed reconstruction at negative explained variance or yields a matrix whose Frobenius distance from the identity deviates from the Haar prediction by more than 0.1 percent would falsify the rotation account.

Figures

Figures reproduced from arXiv: 2605.24577 by Jordan F. McCann.

**Figure 2.** Figure 2: Cross-seed SAE reconstruction at both scales. Black bars: within-seed self-reconstruction [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: The cross-seed rotation R is a uniform random orthogonal matrix. Left: histogram of observed ∥R − I∥F √ across (pair, site) combinations. Main panel: Dyck-3 toy (d = 64), prediction 2 · 64 = 11.31; inset: Pythia-70m (d = 512), prediction √ 2 · 512 = 32.00. Right: eigenvalue-angle histogram of R pooled across all Pythia (pair, site) combinations (green bars; 28,672 eigenvalues) overlaid on the Haar SO(512) … view at source ↗

**Figure 4.** Figure 4: Steering vector transfer. Left: Dyck-3 (Cohort A, shared frozen I/O). Three regimes [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on $\mathrm{SO}(d_{\mathrm{model}})$. We call this phenomenon polymorphism: same function, mutually unintelligible interior coordinates. One matrix multiplication per model pair removes it: an orthogonal Procrustes fit on a single batch of activations transfers sparse-autoencoder feature dictionaries and steering vectors between independently trained models, with no retraining. The phenomenon is invisible to the standard SAE universality metric. Decoder-column cosine similarity matches across seeds at 98%, the SAE-universality headline number, while an SAE trained on one seed reconstructs another seed's activations at negative explained variance, worse than predicting the constant mean. The decoder columns align; the encoder reads from a rotated frame. A single Procrustes rotation $R$ restores reconstruction to within 0.025 EV of the within-seed ceiling at every internal site. $R$ is Haar-distributed: $\|R - I\|_F$ matches the random-orthogonal prediction $\sqrt{2 d_{\mathrm{model}}}$ to 0.1% at $d_{\mathrm{model}} = 512$, and a Kolmogorov-Smirnov test of $R$'s eigenvalue spectrum against Haar $\mathrm{SO}(d_{\mathrm{model}})$ returns $p \approx 1.000$ pooled and per-pair. Diff-of-means steering vectors transfer in three regimes by alignment with $R$'s invariant subspace: clean when pinned by shared output weights, partial when overlapping the rotated subspace, inverted otherwise. With no shared I/O (Pythia), all three collapse to universally inverted. The same rotation account holds across training checkpoints within a single run. Validated on a 104k-parameter Dyck-3 transformer and nine independently-trained Pythia-70m seeds on The Pile, via a pre-registered four-bar operational framework. Frontier-scale (10B+) replication remains open.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins SAE non-transfer on a Haar rotation and shows Procrustes fixes it, but the same-function premise lacks direct output checks.

read the letter

The main thing here is that independently trained transformers differ by a uniform random rotation in their residual streams, and one orthogonal Procrustes fit on activations transfers both SAEs and steering vectors without retraining.

What stands out as new is the specific identification of the misalignment as Haar-distributed on SO(d_model). The Frobenius norm matches the random-orthogonal prediction to 0.1 percent at d_model=512, and the KS test on the eigenvalue spectrum gives p near 1. The steering-vector results across three regimes (clean, partial, inverted) based on overlap with the rotated subspace add a useful extension. The same pattern appears on the small Dyck-3 model, on nine Pythia-70m seeds, and even across checkpoints in one run.

The work is solid on the numbers. Standard cosine similarity hits 98 percent while cross-seed reconstruction EV goes negative; the Procrustes step recovers to within 0.025 EV of the within-seed ceiling at every site. The pre-registered four-bar framework and the honest note that frontier-scale replication is still open are both positives.

The soft spot is the load-bearing assumption that the models compute exactly the same function. The abstract reports no direct comparison of model outputs or losses on the same inputs, so the fitted R could be absorbing small functional differences rather than correcting only a basis change. The stress-test concern lands.

This is for mechanistic-interpretability people already working on SAE universality and feature reuse. A reader in that niche would get a concrete, cheap method worth trying.

It deserves peer review. The empirical claims are precise and the hypothesis is testable with the data they collected.

Referee Report

3 major / 2 minor

Summary. The paper claims that independently trained transformers compute the same function in residual-stream bases differing by a uniform random rotation on SO(d_model), a phenomenon termed polymorphism. An orthogonal Procrustes fit R on a single batch of activations removes this difference, enabling transfer of SAE feature dictionaries and steering vectors between models without retraining. Evidence includes restoration of reconstruction to within 0.025 EV of the within-seed ceiling, ||R - I||_F matching the Haar prediction to 0.1% at d_model=512, and a KS test of R's eigenvalue spectrum against Haar SO(d_model) yielding p≈1. The account is validated on a 104k-parameter Dyck-3 transformer and nine Pythia-70m seeds via a pre-registered four-bar operational framework, with consistent behavior across checkpoints.

Significance. If the result holds, it is significant for mechanistic interpretability: it supplies an operational, falsifiable account of apparent non-universality in SAEs and a practical method to align representations across seeds using one matrix multiplication. The pre-registered framework, quantitative match to Haar measure, and extension from a toy Dyck-3 model to Pythia-70m are explicit strengths that make the claim testable and reproducible. The work could enable reuse of interpretability artifacts at scale if the functional-equivalence premise is confirmed.

major comments (3)

[Abstract] Abstract and validation section: the load-bearing premise that the models compute exactly the same function (differing only by basis change) is not directly verified. No comparison of cross-entropy loss or output distributions on identical held-out inputs is reported; the Procrustes fit on activations could improve alignment even if small functional differences exist, and the Haar/KS results are consistent with either pure rotation or partial absorption of non-rotational variance.
[Steering vector transfer] Steering-vector transfer paragraph: the three-regime claim (clean/partial/inverted depending on overlap with R's invariant subspace) and the collapse to universal inversion under no shared I/O rest on the same unverified functional-equivalence assumption; without an explicit test that the models agree on the underlying computation, the regime distinctions could reflect compensation rather than rotation of an identical function.
[Pythia-70m results] Pythia-70m results paragraph: the reported 0.025 EV gap and p≈1 are impressive, but because the central claim equates activation alignment with functional identity, the manuscript must show that the fitted R does not reduce reconstruction error by fitting to non-rotational discrepancies; otherwise the polymorphism interpretation is under-determined by the data.

minor comments (2)

[Abstract] The abbreviation EV should be defined at first use for readers outside the immediate subfield.
[Figures] Figure captions should explicitly state the number of model pairs and seeds used in each panel to make the quantitative claims immediately verifiable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and for emphasizing the importance of verifying functional equivalence. We respond to each major comment below, drawing on the pre-registered framework, the exact equivalence in the Dyck-3 model, and the quantitative match to Haar measure. We propose a targeted revision where it strengthens the manuscript without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract and validation section: the load-bearing premise that the models compute exactly the same function (differing only by basis change) is not directly verified. No comparison of cross-entropy loss or output distributions on identical held-out inputs is reported; the Procrustes fit on activations could improve alignment even if small functional differences exist, and the Haar/KS results are consistent with either pure rotation or partial absorption of non-rotational variance.

Authors: In the Dyck-3 transformer the models are trained from identical random initializations on the same formal-language task and data, so functional equivalence holds by construction; polymorphism appears only in the residual-stream coordinates. For Pythia-70m the pre-registered four-bar framework evaluates activation alignment and artifact transfer rather than end-to-end loss, because the operational goal is transferability of interpretability tools. We nevertheless agree that an explicit comparison of output distributions on held-out inputs would further substantiate the premise and will add this analysis in revision. The close quantitative agreement of R with Haar SO(d_model) (Frobenius norm to 0.1 % and KS p≈1 on the eigenvalue spectrum) supplies independent evidence against substantial non-rotational absorption: any systematic compensation for functional differences would shift the spectrum away from the Haar prediction. revision: partial
Referee: [Steering vector transfer] Steering-vector transfer paragraph: the three-regime claim (clean/partial/inverted depending on overlap with R's invariant subspace) and the collapse to universal inversion under no shared I/O rest on the same unverified functional-equivalence assumption; without an explicit test that the models agree on the underlying computation, the regime distinctions could reflect compensation rather than rotation of an identical function.

Authors: The regime distinctions are derived geometrically from the fitted orthogonal R and the shared output projection in the Dyck-3 model, where functional identity is exact. Clean transfer occurs precisely when a steering vector lies in the subspace fixed by the output weights; inversion occurs when it lies in the rotated complement. In Pythia-70m the absence of shared I/O produces the observed universal inversion, which is an empirical outcome under the same rotation. The consistency of these regimes across nine independent seeds and the exact match of R to Haar statistics indicate that the distinctions arise from the geometry of a shared computation rather than ad-hoc compensation. revision: no
Referee: [Pythia-70m results] Pythia-70m results paragraph: the reported 0.025 EV gap and p≈1 are impressive, but because the central claim equates activation alignment with functional identity, the manuscript must show that the fitted R does not reduce reconstruction error by fitting to non-rotational discrepancies; otherwise the polymorphism interpretation is under-determined by the data.

Authors: Because Procrustes is constrained to produce an orthogonal matrix, it cannot arbitrarily absorb non-rotational variance; it can only rotate the coordinate frame. The decisive evidence that the fitted R captures a pure rotation is its distribution: the Frobenius norm matches the exact Haar expectation to 0.1 % at d_model=512, and the eigenvalue spectrum is statistically indistinguishable from Haar SO(d_model) (pooled and per-pair KS p≈1). Any non-rotational component large enough to explain the 0.025 EV gap would necessarily perturb this spectrum, which is not observed. The gap itself is measured relative to the within-seed reconstruction ceiling after the rotation has been applied. revision: no

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external distributional tests

full rationale

The paper's central procedure fits an orthogonal Procrustes matrix R to a batch of activations and reports that the resulting R aligns SAEs and steering vectors while its norm and eigenvalue spectrum match the Haar measure on SO(d_model) to high precision (Frobenius norm within 0.1%, KS p≈1). This match is an independent statistical benchmark, not a quantity defined in terms of the fitted R itself. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked. The premise that models implement identical functions is an empirical hypothesis tested by reconstruction metrics and distributional agreement rather than a definitional reduction. The derivation chain therefore contains no load-bearing step that collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on one fitted object (the Procrustes rotation R) and one domain assumption (identical function across seeds). No new particles or forces are postulated.

free parameters (1)

rotation matrix R
Orthogonal matrix obtained by Procrustes alignment on a batch of residual-stream activations for each model pair; used to define the polymorphism and to perform the transfer.

axioms (1)

domain assumption Independently trained models compute exactly the same function
Invoked when the abstract states that the Procrustes rotation 'removes' the polymorphism and restores reconstruction; if the functions differ beyond a basis change, the fitted R compensates for more than rotation.

pith-pipeline@v0.9.1-grok · 5902 in / 1554 out tokens · 76063 ms · 2026-06-30T14:06:38.909595+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 13 canonical work pages · 8 internal anchors

[1]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Halla- han, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 21 Pythia: A suite for analyzing large language models across training and scaling.arXiv preprint arXiv:2304.01373,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Alan Chen, Jack Merullo, Ellie Pavlick, and Alessandro Stolfo

URL https://transformer-circuits.pub/2023/monosemantic-features/index.html. Alan Chen, Jack Merullo, Ellie Pavlick, and Alessandro Stolfo. Transferring linear features across language models with model stitching.arXiv preprint arXiv:2506.06609,

work page arXiv 2023
[3]

Dunefsky, P

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits.arXiv preprint arXiv:2406.11944,

work page arXiv
[4]

Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur

URL https://transformer-circuits.pub/2021/ framework/index.html. Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks. InInternational Conference on Learning Representations,

2021
[5]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Localizing Model Behavior with Path Patching

Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching.arXiv preprint arXiv:2304.05969,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

AtP*: An efficient and scalable method for localizing LLM behaviour to components.arXiv preprint arXiv:2403.00745,

J´ anos Kram´ ar, Tom Lieberum, Rohin Shah, and Neel Nanda. AtP*: An efficient and scalable method for localizing LLM behaviour to components.arXiv preprint arXiv:2403.00745,

work page arXiv
[8]

Sparse autoencoders reveal universal feature spaces across large language models.arXiv preprint arXiv:2410.06981,

Michael Lan, Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger, and Fazl Barez. Sparse autoencoders reveal universal feature spaces across large language models.arXiv preprint arXiv:2410.06981,

work page arXiv
[9]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

22 Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, J´ anos Kram´ ar, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma Scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Zoom In : An Introduction to Circuits

doi: 10.23915/distill.00024.001. https: //distill.pub/2020/circuits/zoom-in/. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.Transformer Circuits Thread,

work page doi:10.23915/distill.00024.001 2020
[12]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Mac- Diarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

The honest summary: the main findings of the paper assume training and evaluation cover the same length range

would change this. The honest summary: the main findings of the paper assume training and evaluation cover the same length range. B Symmetry-alignment uniqueness For seed 0 self-alignment the algorithm recovers MSE = 1 .3 × 10−10—essentially the fp32 noise floor. For seed N vs seed 0 (Cohort A coordinated training), all 32 multi-start configurations (16 r...

2060

[1] [1]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Halla- han, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 21 Pythia: A suite for analyzing large language models across training and scaling.arXiv preprint arXiv:2304.01373,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Alan Chen, Jack Merullo, Ellie Pavlick, and Alessandro Stolfo

URL https://transformer-circuits.pub/2023/monosemantic-features/index.html. Alan Chen, Jack Merullo, Ellie Pavlick, and Alessandro Stolfo. Transferring linear features across language models with model stitching.arXiv preprint arXiv:2506.06609,

work page arXiv 2023

[3] [3]

Dunefsky, P

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits.arXiv preprint arXiv:2406.11944,

work page arXiv

[4] [4]

Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur

URL https://transformer-circuits.pub/2021/ framework/index.html. Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks. InInternational Conference on Learning Representations,

2021

[5] [5]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Localizing Model Behavior with Path Patching

Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching.arXiv preprint arXiv:2304.05969,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

AtP*: An efficient and scalable method for localizing LLM behaviour to components.arXiv preprint arXiv:2403.00745,

J´ anos Kram´ ar, Tom Lieberum, Rohin Shah, and Neel Nanda. AtP*: An efficient and scalable method for localizing LLM behaviour to components.arXiv preprint arXiv:2403.00745,

work page arXiv

[8] [8]

Sparse autoencoders reveal universal feature spaces across large language models.arXiv preprint arXiv:2410.06981,

Michael Lan, Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger, and Fazl Barez. Sparse autoencoders reveal universal feature spaces across large language models.arXiv preprint arXiv:2410.06981,

work page arXiv

[9] [9]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

22 Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, J´ anos Kram´ ar, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma Scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Zoom In : An Introduction to Circuits

doi: 10.23915/distill.00024.001. https: //distill.pub/2020/circuits/zoom-in/. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.Transformer Circuits Thread,

work page doi:10.23915/distill.00024.001 2020

[12] [12]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Mac- Diarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

The honest summary: the main findings of the paper assume training and evaluation cover the same length range

would change this. The honest summary: the main findings of the paper assume training and evaluation cover the same length range. B Symmetry-alignment uniqueness For seed 0 self-alignment the algorithm recovers MSE = 1 .3 × 10−10—essentially the fp32 noise floor. For seed N vs seed 0 (Cohort A coordinated training), all 32 multi-start configurations (16 r...

2060