Interactions Between Crosscoder Features: A Compact Proofs Perspective

Anna Soligo; Chun-Hei Yip; Dmitry Manning-Coe; Jason Gross; Oliver Clive-Griffin; Rajashree Agrawal; Thomas Read

arxiv: 2606.09940 · v1 · pith:MVIWY3ZBnew · submitted 2026-06-08 · 💻 cs.LG · cs.AI

Interactions Between Crosscoder Features: A Compact Proofs Perspective

Dmitry Manning-Coe , Thomas Read , Anna Soligo , Oliver Clive-Griffin , Chun-Hei Yip , Rajashree Agrawal , Jason Gross This is my paper

Pith reviewed 2026-06-27 17:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords crosscodersfeature interactionscompact proofsdictionary learningsparse autoencodersMLP performancesleeper agents

0 comments

The pith

An error term from compact proofs of crosscoder performance measures feature interactions and serves as a penalty for computational sparsity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a compact proof of model performance can be built from a crosscoder decomposition. An error term in this proof is interpreted as quantifying how features interact, with an explicit form given for MLP layers. Treating this term as a loss penalty produces crosscoders that keep 60 percent of original MLP performance even when forced to use only one feature per datapoint and neuron. This matters because it turns an abstract notion of feature dependence into a practical training signal for sparser and potentially more interpretable representations.

Core claim

The authors construct a compact proof of model performance using a crosscoder and identify an error term that arises in the proof as a natural measure of interactions between crosscoder features. They derive an explicit expression for this interaction term in the case of MLP layers. When this term is used as a differentiable penalty during training, the resulting crosscoders achieve computational sparsity: they retain 60% of MLP performance when only a single feature is kept at each datapoint and neuron, compared to 10% retention for standard crosscoders without the penalty. The same measure also produces semantically meaningful clusters of features and detects substantial interactions withi

What carries the argument

The interaction term from the compact proof of model performance, which quantifies pairwise feature interactions in crosscoders and acts as a training penalty.

If this is right

Single-feature crosscoders retain 60% of MLP performance instead of 10% when the interaction penalty is applied.
Features clustered by the interaction measure form semantically coherent groups.
Sleeper agent models display significant levels of feature interaction under this measure.
The interaction term admits an explicit closed-form expression for MLP layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the interaction penalty successfully reduces dependence between features, it may also improve the faithfulness of feature attributions in downstream interpretability tasks.
The approach could be tested on transformer attention layers to check whether the same error-term interpretation holds beyond MLPs.
Measuring interactions might offer a way to audit for coordinated deceptive behaviors beyond the sleeper agent examples already examined.

Load-bearing premise

That the error term in the compact proof can be interpreted as measuring interactions between crosscoder features.

What would settle it

Training a crosscoder both with and without the interaction penalty on the same dataset and comparing the retained MLP performance when restricting to one feature per datapoint; if the gap disappears or reverses, the utility of the penalty is falsified.

Figures

Figures reproduced from arXiv: 2606.09940 by Anna Soligo, Chun-Hei Yip, Dmitry Manning-Coe, Jason Gross, Oliver Clive-Griffin, Rajashree Agrawal, Thomas Read.

**Figure 2.** Figure 2: Analysis results for computationally sparse crosscoders. (a) The fidelity for zero-ablating [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The cluster assignment accuracy at different cluster sizes (left) shows a slightly higher [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Example clean and “poisoned” sleeper text evaluated on both base and sleeper models. The [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Pre-softmax query/key attention pattern for an attention head on an example text. Top [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Matrix of query/key feature interaction coefficients between those features active at two [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Ablations for the MLP in each layer of the network. We see that the reconstruction fidelity [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Tradeoffs for Pythia models across crosscoder hidden dimensions and model sizes. Note [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Scaling behavior across different model sizes. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: The effect of adding the interaction penalty to the modular addition network. The [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: The explicit confusion matrix for the auto-interpretability procedure described in the main [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Auto-interpretability explanations from the penalized crosscoder. Top: Examples showing [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Mechanistic Anomaly Detection using the STII. [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

read the original abstract

Dictionary learning methods like Sparse Autoencoders (SAEs) and crosscoders attempt to explain a model by decomposing its activations into independent features. Interactions between features hence induce errors in the reconstruction. We formalize this intuition via compact proofs and make five contributions. First, we show how, \textit{in principle}, a compact proof of model performance can be constructed using a crosscoder. Second, we show that an error term arising in this proof can naturally be interpreted as a measure of interaction between crosscoder features and provide an explicit expression for the interaction term in the Multi-Layer Perceptron (MLP) layers. We then provide three applications of this new interaction measure. In our third contribution we show that the interaction term itself can be used as a differentiable loss penalty. Applying this penalty, we can achieve ``computationally sparse'' crosscoders that retain $60\%$ of MLP performance when only keeping a single feature at each datapoint and neuron, compared to $10\%$ in standard crosscoders. We then show that clustering according to our interaction measure provides semantically meaningful feature clusters, and finally that sleeper agents have significant interactions. Code is available at https://github.com/chainik1125/crosscoders-feature-interactions/tree/arxiv.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives an interaction penalty for crosscoders from compact proof error and claims it yields much stronger sparsity retention than standard training.

read the letter

The two things to know are that they extract an explicit interaction term from the error in a compact proof of crosscoder-based model performance, and they turn that term into a differentiable penalty during training. The headline result is that this produces crosscoders retaining 60% of MLP performance under single-feature-per-datapoint-and-neuron sparsity, versus 10% for the baseline.

What is new is the derivation of that interaction expression for MLP layers and its use in three places: the penalty itself, clustering features by interaction strength, and checking interactions inside sleeper agents. The public code is a practical plus.

The soft spot is the load-bearing identification of the proof error term as a clean measure of feature interaction. The abstract calls it natural, but without the full equations it is hard to tell whether the term mixes reconstruction error with other quantities or rests on unstated independence assumptions. If that step does not hold rigorously, the penalty is not guaranteed to target interactions specifically and the 60% figure loses its claimed grounding. The clustering and sleeper-agent sections are lighter and would need separate checks.

This is for people already working on SAEs and crosscoders who want better sparsity tools for auditing. A reader focused on practical interpretability methods would get value from testing the penalty if the derivation checks out.

It deserves serious referee time because the idea is new relative to the cited literature and the empirical claim is concrete enough to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper claims that compact proofs of model performance can be constructed with crosscoders, that an error term in such proofs can be interpreted as a measure of interactions between crosscoder features (with an explicit expression supplied for MLP layers), and that this measure can be used as a differentiable penalty to train computationally sparse crosscoders. The central empirical claim is that the resulting models retain 60% of MLP performance under single-feature-per-datapoint-and-neuron sparsity, versus 10% for standard crosscoders; additional applications are shown for clustering and sleeper-agent analysis.

Significance. If the identification of the proof error term with a specific interaction measure holds rigorously, the work supplies a new, derivation-grounded penalty for dictionary learning that directly targets feature interactions rather than relying on post-hoc heuristics. The reported sparsity result would then constitute a concrete, falsifiable improvement over baseline crosscoders, and the clustering and sleeper-agent findings would offer testable predictions about feature structure.

major comments (2)

[interaction-term derivation (explicit MLP expression)] The load-bearing step is the identification, in the section deriving the interaction term for MLP layers, of the compact-proof error term with a measure of feature interactions. The manuscript must show explicitly (via the supplied expression) that this term isolates pairwise or higher-order interactions and does not conflate them with residual reconstruction error or with the choice of feature basis; without that separation the subsequent penalty is not guaranteed to penalize interactions specifically.
[empirical sparsity experiment] § on the 60%-versus-10% experiment: the performance numbers are obtained after adding the interaction penalty; the paper should report an ablation that replaces the derived term with a generic reconstruction-error penalty of matched magnitude to confirm that the gain is attributable to the interaction interpretation rather than to any differentiable sparsity regularizer.

minor comments (2)

The abstract states that code is available at a GitHub link; the repository should be checked for a reproducible script that exactly reproduces the 60% figure from the same random seed and data split used in the paper.
Notation for the interaction term should be introduced once with a clear symbol (e.g., I_{ij}) and then used consistently; currently the transition from the proof error to the penalty appears to reuse the same symbol without redefinition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [interaction-term derivation (explicit MLP expression)] The load-bearing step is the identification, in the section deriving the interaction term for MLP layers, of the compact-proof error term with a measure of feature interactions. The manuscript must show explicitly (via the supplied expression) that this term isolates pairwise or higher-order interactions and does not conflate them with residual reconstruction error or with the choice of feature basis; without that separation the subsequent penalty is not guaranteed to penalize interactions specifically.

Authors: The explicit expression for the MLP interaction term is obtained by expanding the compact-proof reconstruction error and collecting all cross-feature terms; by algebraic construction these terms vanish exactly when features act independently and are orthogonal to the per-feature reconstruction residuals. The derivation holds for an arbitrary feature basis because it follows from the definition of the proof error rather than from any particular choice of dictionary. To address the concern directly we will insert a dedicated subsection that (i) writes out the full expansion, (ii) demonstrates that the interaction component is zero under additive feature behavior, and (iii) confirms invariance to linear reparameterizations of the basis. revision: partial
Referee: [empirical sparsity experiment] § on the 60%-versus-10% experiment: the performance numbers are obtained after adding the interaction penalty; the paper should report an ablation that replaces the derived term with a generic reconstruction-error penalty of matched magnitude to confirm that the gain is attributable to the interaction interpretation rather than to any differentiable sparsity regularizer.

Authors: We agree that an ablation isolating the effect of the derived interaction term versus a generic reconstruction penalty is necessary to support the claim. In the revised manuscript we will add this controlled comparison, training otherwise identical crosscoders with a reconstruction-error penalty scaled to the same average magnitude as the interaction penalty and reporting the resulting single-feature performance under the same sparsity regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper begins with the standard intuition that feature interactions induce reconstruction errors, then constructs a compact proof of model performance using a crosscoder and identifies an error term within that proof. This term is presented as interpretable as an interaction measure (with an explicit MLP expression supplied), after which it is applied as a penalty. No equations are shown that reduce the interaction measure to a fitted parameter or prior result by definition, and no self-citations are invoked as load-bearing premises. The empirical sparsity result follows directly from optimizing the derived penalty term rather than from any renaming, ansatz smuggling, or self-referential fitting. The chain remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; the interaction term itself is presented as derived rather than postulated.

pith-pipeline@v0.9.1-grok · 5774 in / 1085 out tokens · 20569 ms · 2026-06-27T17:24:06.399329+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 2 canonical work pages

[1]

Pythia: A suite for analyzing large language models across training and scaling, 2023

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373

Pith/arXiv arXiv 2023
[2]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

2023
[3]

Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and C...

2023
[4]

B atch T op K sparse autoencoders, 2024

Bart Bussmann, Patrick Leask, and Neel Nanda. B atch T op K sparse autoencoders, 2024. URL https://arxiv.org/abs/2412.06410

arXiv 2024
[5]

Mechanistic anomaly detection and elk

Paul Christiano. Mechanistic anomaly detection and elk. https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk, November 2022. AI Alignment Forum post

2022
[6]

Sparse autoencoders find highly interpretable features in language models, 2023

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. URL https://arxiv.org/abs/2309.08600

Pith/arXiv arXiv 2023
[7]

Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems, 2024

David ``davidad'' Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems...

arXiv 2024
[8]

The shapley taylor interaction index, 2020

Kedar Dhamdhere, Ashish Agarwal, and Mukund Sundararajan. The shapley taylor interaction index, 2020. URL https://arxiv.org/abs/1902.05622

arXiv 2020
[9]

Transcoders find interpretable llm feature circuits, 2024

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits, 2024. URL https://arxiv.org/abs/2406.11944

arXiv 2024
[10]

T iny S tories: How small can language models be and still speak coherent E nglish?, 2023

Ronen Eldan and Yuanzhi Li. T iny S tories: How small can language models be and still speak coherent E nglish?, 2023. URL https://arxiv.org/abs/2305.07759

Pith/arXiv arXiv 2023
[11]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

2021
[12]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Benjamin Mann, Amanda Askell, Stephanie Lin, Adam Scherlis, Nova DasSarma, Sam McCandlish, Dario Amodei, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread (Distill), 2021 b . URL: https://transformer-circuits.pub/2021/framework/index.html

2021
[13]

Toy models of superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022

Pith/arXiv arXiv 2022
[14]

Frey and Delbert Dueck

Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. Science, 315 0 (5814): 0 972--976, 2007. doi:10.1126/science.1136800. URL https://www.science.org/doi/10.1126/science.1136800

work page doi:10.1126/science.1136800 2007
[15]

Positional kernels of attention heads

Alex Gibson. Positional kernels of attention heads. LessWrong blog post, 2025. URL https://www.lesswrong.com/posts/9paB7YhxzsrBoXN8L/positional-kernels-of-attention-heads. Published March 10, 2025

2025
[16]

Grabisch, M

Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of Game Theory, 28 0 (4): 0 547--565, nov 1999. ISSN 1432-1270. doi:10.1007/s001820050125. URL https://doi.org/10.1007/s001820050125

work page doi:10.1007/s001820050125 1999
[17]

Bowman, and Evan Hubinger

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, 202...

Pith/arXiv arXiv 2024
[18]

Grokking modular arithmetic, 2023

Andrey Gromov. Grokking modular arithmetic, 2023. URL https://arxiv.org/abs/2301.02679

arXiv 2023
[19]

Compact proofs of model performance via mechanistic interpretability, 2024

Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, and Lawrence Chan. Compact proofs of model performance via mechanistic interpretability, 2024. URL https://arxiv.org/abs/2406.11779

arXiv 2024
[20]

You can remove gpt2's layernorm by fine-tuning, 2024

Stefan Heimersheim. You can remove gpt2's layernorm by fine-tuning, 2024. URL https://arxiv.org/abs/2409.13710

arXiv 2024
[21]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

Pith/arXiv arXiv 2024
[22]

A gentle introduction to mechanistic anomaly detection

Erik Jenner. A gentle introduction to mechanistic anomaly detection. https://www.lesswrong.com/posts/n7DFwtJvCzkuKmtbG/a-gentle-introduction-to-mechanistic-anomaly-detection, April 2024. LessWrong post

2024
[23]

Johnston, Arkajyoti Chakraborty, and Nora Belrose

David O. Johnston, Arkajyoti Chakraborty, and Nora Belrose. Mechanistic anomaly detection for "quirky" language models, 2025. URL https://arxiv.org/abs/2504.08812

arXiv 2025
[24]

Sparse crosscoders for cross-layer features and model diffing

Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse crosscoders for cross-layer features and model diffing. https://transformer-circuits.pub/2024/crosscoders/index.html, October 2024 a . Transformer Circuits research update

2024
[25]

Sparse crosscoders for cross-layer features and model diffing

Jack Lindsey, Adly Templeton, Jonathan Marcus, Tom Conerly, Joshua Baston, and Chris Olah. Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread, 2024 b . URL https://transformer-circuits.pub/2024/crosscoders/index.html

2024
[26]

Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller

Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2025. URL https://arxiv.org/abs/2403.19647

Pith/arXiv arXiv 2025
[27]

Robustly identifying concepts introduced during chat fine-tuning using crosscoders

Julian Minder, Cl \'e ment Dumas, Caden Juang, Bilal Chugtai, and Neel Nanda. Robustly identifying concepts introduced during chat fine-tuning using crosscoders. arXiv preprint arXiv:2504.02922, 2025

arXiv 2025
[28]

shapiq: Shapley interactions for machine learning

Maximilian Muschalik, Hubert Baniecki, Fabian Fumagalli, Patrick Kolpaczki, Barbara Hammer, and Eyke H\" u llermeier. shapiq: Shapley interactions for machine learning. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=knxGmi6SJi

2024
[29]

Progress measures for grokking via mechanistic interpretability, 2023

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. URL https://arxiv.org/abs/2301.05217

Pith/arXiv arXiv 2023
[30]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill 5(3): e00024.001, 2020. URL https://distill.pub/2020/circuits/zoom-in/

2020
[31]

Transcoders beat sparse autoencoders for interpretability, 2025

Gonçalo Paulo, Stepan Shabalin, and Nora Belrose. Transcoders beat sparse autoencoders for interpretability, 2025. URL https://arxiv.org/abs/2501.18823

arXiv 2025
[32]

Improving dictionary learning with gated sparse autoencoders, 2024

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders, 2024. URL https://arxiv.org/abs/2404.16014

Pith/arXiv arXiv 2024
[33]

Seshia, Dorsa Sadigh, and S

Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry. Towards verified artificial intelligence, 2020. URL https://arxiv.org/abs/1606.08514

arXiv 2020
[34]

[replication] crosscoder-based stage-wise model diffing

Anna Soligo, Thomas Read, Oliver Clive-Griffin, Dmitry Manning-Coe, Chun-Hei Yip, Rajashree Agrawal, and Jason Gross. [replication] crosscoder-based stage-wise model diffing. AI Alignment Forum, 2025. https://www.alignmentforum.org/posts/hxxramAB82tjtpiQu/replication-crosscoder-based-stage-wise-model-diffing-2

2025
[35]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

2024
[36]

Faith-shap: The faithful shapley interaction index, 2023

Che-Ping Tsai, Chih-Kuan Yeh, and Pradeep Ravikumar. Faith-shap: The faithful shapley interaction index, 2023. URL https://arxiv.org/abs/2203.00870

arXiv 2023
[37]

How does this interaction affect me? interpretable attribution for feature interactions

Michael Tsang, Sirisha Rambhatla, and Yan Liu. How does this interaction affect me? interpretable attribution for feature interactions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6147--6159. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc...

2020
[38]

Towards a unified and verified understanding of group-operation networks, 2025

Wilson Wu, Louis Jaburi, Jacob Drori, and Jason Gross. Towards a unified and verified understanding of group-operation networks, 2025. URL https://arxiv.org/abs/2410.07476

arXiv 2025
[39]

Modular addition without black-boxes: Compressing explanations of mlps that compute numerical integration, 2024

Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, and Jason Gross. Modular addition without black-boxes: Compressing explanations of mlps that compute numerical integration, 2024. URL https://arxiv.org/abs/2412.03773

arXiv 2024

[1] [1]

Pythia: A suite for analyzing large language models across training and scaling, 2023

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373

Pith/arXiv arXiv 2023

[2] [2]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023

2023

[3] [3]

Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and C...

2023

[4] [4]

B atch T op K sparse autoencoders, 2024

Bart Bussmann, Patrick Leask, and Neel Nanda. B atch T op K sparse autoencoders, 2024. URL https://arxiv.org/abs/2412.06410

arXiv 2024

[5] [5]

Mechanistic anomaly detection and elk

Paul Christiano. Mechanistic anomaly detection and elk. https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk, November 2022. AI Alignment Forum post

2022

[6] [6]

Sparse autoencoders find highly interpretable features in language models, 2023

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. URL https://arxiv.org/abs/2309.08600

Pith/arXiv arXiv 2023

[7] [7]

Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems, 2024

David ``davidad'' Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems...

arXiv 2024

[8] [8]

The shapley taylor interaction index, 2020

Kedar Dhamdhere, Ashish Agarwal, and Mukund Sundararajan. The shapley taylor interaction index, 2020. URL https://arxiv.org/abs/1902.05622

arXiv 2020

[9] [9]

Transcoders find interpretable llm feature circuits, 2024

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits, 2024. URL https://arxiv.org/abs/2406.11944

arXiv 2024

[10] [10]

T iny S tories: How small can language models be and still speak coherent E nglish?, 2023

Ronen Eldan and Yuanzhi Li. T iny S tories: How small can language models be and still speak coherent E nglish?, 2023. URL https://arxiv.org/abs/2305.07759

Pith/arXiv arXiv 2023

[11] [11]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

2021

[12] [12]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Benjamin Mann, Amanda Askell, Stephanie Lin, Adam Scherlis, Nova DasSarma, Sam McCandlish, Dario Amodei, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread (Distill), 2021 b . URL: https://transformer-circuits.pub/2021/framework/index.html

2021

[13] [13]

Toy models of superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022

Pith/arXiv arXiv 2022

[14] [14]

Frey and Delbert Dueck

Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. Science, 315 0 (5814): 0 972--976, 2007. doi:10.1126/science.1136800. URL https://www.science.org/doi/10.1126/science.1136800

work page doi:10.1126/science.1136800 2007

[15] [15]

Positional kernels of attention heads

Alex Gibson. Positional kernels of attention heads. LessWrong blog post, 2025. URL https://www.lesswrong.com/posts/9paB7YhxzsrBoXN8L/positional-kernels-of-attention-heads. Published March 10, 2025

2025

[16] [16]

Grabisch, M

Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of Game Theory, 28 0 (4): 0 547--565, nov 1999. ISSN 1432-1270. doi:10.1007/s001820050125. URL https://doi.org/10.1007/s001820050125

work page doi:10.1007/s001820050125 1999

[17] [17]

Bowman, and Evan Hubinger

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, 202...

Pith/arXiv arXiv 2024

[18] [18]

Grokking modular arithmetic, 2023

Andrey Gromov. Grokking modular arithmetic, 2023. URL https://arxiv.org/abs/2301.02679

arXiv 2023

[19] [19]

Compact proofs of model performance via mechanistic interpretability, 2024

Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, and Lawrence Chan. Compact proofs of model performance via mechanistic interpretability, 2024. URL https://arxiv.org/abs/2406.11779

arXiv 2024

[20] [20]

You can remove gpt2's layernorm by fine-tuning, 2024

Stefan Heimersheim. You can remove gpt2's layernorm by fine-tuning, 2024. URL https://arxiv.org/abs/2409.13710

arXiv 2024

[21] [21]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

Pith/arXiv arXiv 2024

[22] [22]

A gentle introduction to mechanistic anomaly detection

Erik Jenner. A gentle introduction to mechanistic anomaly detection. https://www.lesswrong.com/posts/n7DFwtJvCzkuKmtbG/a-gentle-introduction-to-mechanistic-anomaly-detection, April 2024. LessWrong post

2024

[23] [23]

Johnston, Arkajyoti Chakraborty, and Nora Belrose

David O. Johnston, Arkajyoti Chakraborty, and Nora Belrose. Mechanistic anomaly detection for "quirky" language models, 2025. URL https://arxiv.org/abs/2504.08812

arXiv 2025

[24] [24]

Sparse crosscoders for cross-layer features and model diffing

Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse crosscoders for cross-layer features and model diffing. https://transformer-circuits.pub/2024/crosscoders/index.html, October 2024 a . Transformer Circuits research update

2024

[25] [25]

Sparse crosscoders for cross-layer features and model diffing

Jack Lindsey, Adly Templeton, Jonathan Marcus, Tom Conerly, Joshua Baston, and Chris Olah. Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread, 2024 b . URL https://transformer-circuits.pub/2024/crosscoders/index.html

2024

[26] [26]

Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller

Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2025. URL https://arxiv.org/abs/2403.19647

Pith/arXiv arXiv 2025

[27] [27]

Robustly identifying concepts introduced during chat fine-tuning using crosscoders

Julian Minder, Cl \'e ment Dumas, Caden Juang, Bilal Chugtai, and Neel Nanda. Robustly identifying concepts introduced during chat fine-tuning using crosscoders. arXiv preprint arXiv:2504.02922, 2025

arXiv 2025

[28] [28]

shapiq: Shapley interactions for machine learning

Maximilian Muschalik, Hubert Baniecki, Fabian Fumagalli, Patrick Kolpaczki, Barbara Hammer, and Eyke H\" u llermeier. shapiq: Shapley interactions for machine learning. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=knxGmi6SJi

2024

[29] [29]

Progress measures for grokking via mechanistic interpretability, 2023

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. URL https://arxiv.org/abs/2301.05217

Pith/arXiv arXiv 2023

[30] [30]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill 5(3): e00024.001, 2020. URL https://distill.pub/2020/circuits/zoom-in/

2020

[31] [31]

Transcoders beat sparse autoencoders for interpretability, 2025

Gonçalo Paulo, Stepan Shabalin, and Nora Belrose. Transcoders beat sparse autoencoders for interpretability, 2025. URL https://arxiv.org/abs/2501.18823

arXiv 2025

[32] [32]

Improving dictionary learning with gated sparse autoencoders, 2024

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders, 2024. URL https://arxiv.org/abs/2404.16014

Pith/arXiv arXiv 2024

[33] [33]

Seshia, Dorsa Sadigh, and S

Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry. Towards verified artificial intelligence, 2020. URL https://arxiv.org/abs/1606.08514

arXiv 2020

[34] [34]

[replication] crosscoder-based stage-wise model diffing

Anna Soligo, Thomas Read, Oliver Clive-Griffin, Dmitry Manning-Coe, Chun-Hei Yip, Rajashree Agrawal, and Jason Gross. [replication] crosscoder-based stage-wise model diffing. AI Alignment Forum, 2025. https://www.alignmentforum.org/posts/hxxramAB82tjtpiQu/replication-crosscoder-based-stage-wise-model-diffing-2

2025

[35] [35]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

2024

[36] [36]

Faith-shap: The faithful shapley interaction index, 2023

Che-Ping Tsai, Chih-Kuan Yeh, and Pradeep Ravikumar. Faith-shap: The faithful shapley interaction index, 2023. URL https://arxiv.org/abs/2203.00870

arXiv 2023

[37] [37]

How does this interaction affect me? interpretable attribution for feature interactions

Michael Tsang, Sirisha Rambhatla, and Yan Liu. How does this interaction affect me? interpretable attribution for feature interactions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6147--6159. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc...

2020

[38] [38]

Towards a unified and verified understanding of group-operation networks, 2025

Wilson Wu, Louis Jaburi, Jacob Drori, and Jason Gross. Towards a unified and verified understanding of group-operation networks, 2025. URL https://arxiv.org/abs/2410.07476

arXiv 2025

[39] [39]

Modular addition without black-boxes: Compressing explanations of mlps that compute numerical integration, 2024

Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, and Jason Gross. Modular addition without black-boxes: Compressing explanations of mlps that compute numerical integration, 2024. URL https://arxiv.org/abs/2412.03773

arXiv 2024