Interactions Between Crosscoder Features: A Compact Proofs Perspective
Pith reviewed 2026-06-27 17:24 UTC · model grok-4.3
The pith
An error term from compact proofs of crosscoder performance measures feature interactions and serves as a penalty for computational sparsity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors construct a compact proof of model performance using a crosscoder and identify an error term that arises in the proof as a natural measure of interactions between crosscoder features. They derive an explicit expression for this interaction term in the case of MLP layers. When this term is used as a differentiable penalty during training, the resulting crosscoders achieve computational sparsity: they retain 60% of MLP performance when only a single feature is kept at each datapoint and neuron, compared to 10% retention for standard crosscoders without the penalty. The same measure also produces semantically meaningful clusters of features and detects substantial interactions withi
What carries the argument
The interaction term from the compact proof of model performance, which quantifies pairwise feature interactions in crosscoders and acts as a training penalty.
If this is right
- Single-feature crosscoders retain 60% of MLP performance instead of 10% when the interaction penalty is applied.
- Features clustered by the interaction measure form semantically coherent groups.
- Sleeper agent models display significant levels of feature interaction under this measure.
- The interaction term admits an explicit closed-form expression for MLP layers.
Where Pith is reading between the lines
- If the interaction penalty successfully reduces dependence between features, it may also improve the faithfulness of feature attributions in downstream interpretability tasks.
- The approach could be tested on transformer attention layers to check whether the same error-term interpretation holds beyond MLPs.
- Measuring interactions might offer a way to audit for coordinated deceptive behaviors beyond the sleeper agent examples already examined.
Load-bearing premise
That the error term in the compact proof can be interpreted as measuring interactions between crosscoder features.
What would settle it
Training a crosscoder both with and without the interaction penalty on the same dataset and comparing the retained MLP performance when restricting to one feature per datapoint; if the gap disappears or reverses, the utility of the penalty is falsified.
Figures
read the original abstract
Dictionary learning methods like Sparse Autoencoders (SAEs) and crosscoders attempt to explain a model by decomposing its activations into independent features. Interactions between features hence induce errors in the reconstruction. We formalize this intuition via compact proofs and make five contributions. First, we show how, \textit{in principle}, a compact proof of model performance can be constructed using a crosscoder. Second, we show that an error term arising in this proof can naturally be interpreted as a measure of interaction between crosscoder features and provide an explicit expression for the interaction term in the Multi-Layer Perceptron (MLP) layers. We then provide three applications of this new interaction measure. In our third contribution we show that the interaction term itself can be used as a differentiable loss penalty. Applying this penalty, we can achieve ``computationally sparse'' crosscoders that retain $60\%$ of MLP performance when only keeping a single feature at each datapoint and neuron, compared to $10\%$ in standard crosscoders. We then show that clustering according to our interaction measure provides semantically meaningful feature clusters, and finally that sleeper agents have significant interactions. Code is available at https://github.com/chainik1125/crosscoders-feature-interactions/tree/arxiv.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that compact proofs of model performance can be constructed with crosscoders, that an error term in such proofs can be interpreted as a measure of interactions between crosscoder features (with an explicit expression supplied for MLP layers), and that this measure can be used as a differentiable penalty to train computationally sparse crosscoders. The central empirical claim is that the resulting models retain 60% of MLP performance under single-feature-per-datapoint-and-neuron sparsity, versus 10% for standard crosscoders; additional applications are shown for clustering and sleeper-agent analysis.
Significance. If the identification of the proof error term with a specific interaction measure holds rigorously, the work supplies a new, derivation-grounded penalty for dictionary learning that directly targets feature interactions rather than relying on post-hoc heuristics. The reported sparsity result would then constitute a concrete, falsifiable improvement over baseline crosscoders, and the clustering and sleeper-agent findings would offer testable predictions about feature structure.
major comments (2)
- [interaction-term derivation (explicit MLP expression)] The load-bearing step is the identification, in the section deriving the interaction term for MLP layers, of the compact-proof error term with a measure of feature interactions. The manuscript must show explicitly (via the supplied expression) that this term isolates pairwise or higher-order interactions and does not conflate them with residual reconstruction error or with the choice of feature basis; without that separation the subsequent penalty is not guaranteed to penalize interactions specifically.
- [empirical sparsity experiment] § on the 60%-versus-10% experiment: the performance numbers are obtained after adding the interaction penalty; the paper should report an ablation that replaces the derived term with a generic reconstruction-error penalty of matched magnitude to confirm that the gain is attributable to the interaction interpretation rather than to any differentiable sparsity regularizer.
minor comments (2)
- The abstract states that code is available at a GitHub link; the repository should be checked for a reproducible script that exactly reproduces the 60% figure from the same random seed and data split used in the paper.
- Notation for the interaction term should be introduced once with a clear symbol (e.g., I_{ij}) and then used consistently; currently the transition from the proof error to the penalty appears to reuse the same symbol without redefinition.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [interaction-term derivation (explicit MLP expression)] The load-bearing step is the identification, in the section deriving the interaction term for MLP layers, of the compact-proof error term with a measure of feature interactions. The manuscript must show explicitly (via the supplied expression) that this term isolates pairwise or higher-order interactions and does not conflate them with residual reconstruction error or with the choice of feature basis; without that separation the subsequent penalty is not guaranteed to penalize interactions specifically.
Authors: The explicit expression for the MLP interaction term is obtained by expanding the compact-proof reconstruction error and collecting all cross-feature terms; by algebraic construction these terms vanish exactly when features act independently and are orthogonal to the per-feature reconstruction residuals. The derivation holds for an arbitrary feature basis because it follows from the definition of the proof error rather than from any particular choice of dictionary. To address the concern directly we will insert a dedicated subsection that (i) writes out the full expansion, (ii) demonstrates that the interaction component is zero under additive feature behavior, and (iii) confirms invariance to linear reparameterizations of the basis. revision: partial
-
Referee: [empirical sparsity experiment] § on the 60%-versus-10% experiment: the performance numbers are obtained after adding the interaction penalty; the paper should report an ablation that replaces the derived term with a generic reconstruction-error penalty of matched magnitude to confirm that the gain is attributable to the interaction interpretation rather than to any differentiable sparsity regularizer.
Authors: We agree that an ablation isolating the effect of the derived interaction term versus a generic reconstruction penalty is necessary to support the claim. In the revised manuscript we will add this controlled comparison, training otherwise identical crosscoders with a reconstruction-error penalty scaled to the same average magnitude as the interaction penalty and reporting the resulting single-feature performance under the same sparsity regime. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper begins with the standard intuition that feature interactions induce reconstruction errors, then constructs a compact proof of model performance using a crosscoder and identifies an error term within that proof. This term is presented as interpretable as an interaction measure (with an explicit MLP expression supplied), after which it is applied as a penalty. No equations are shown that reduce the interaction measure to a fitted parameter or prior result by definition, and no self-citations are invoked as load-bearing premises. The empirical sparsity result follows directly from optimizing the derived penalty term rather than from any renaming, ansatz smuggling, or self-referential fitting. The chain remains independent of its inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Pythia: A suite for analyzing large language models across training and scaling, 2023
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373
Pith/arXiv arXiv 2023
-
[2]
Language models can explain neurons in language models
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023
2023
-
[3]
Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and C...
2023
-
[4]
B atch T op K sparse autoencoders, 2024
Bart Bussmann, Patrick Leask, and Neel Nanda. B atch T op K sparse autoencoders, 2024. URL https://arxiv.org/abs/2412.06410
arXiv 2024
-
[5]
Mechanistic anomaly detection and elk
Paul Christiano. Mechanistic anomaly detection and elk. https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk, November 2022. AI Alignment Forum post
2022
-
[6]
Sparse autoencoders find highly interpretable features in language models, 2023
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. URL https://arxiv.org/abs/2309.08600
Pith/arXiv arXiv 2023
-
[7]
Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems, 2024
David ``davidad'' Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems...
arXiv 2024
-
[8]
The shapley taylor interaction index, 2020
Kedar Dhamdhere, Ashish Agarwal, and Mukund Sundararajan. The shapley taylor interaction index, 2020. URL https://arxiv.org/abs/1902.05622
arXiv 2020
-
[9]
Transcoders find interpretable llm feature circuits, 2024
Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits, 2024. URL https://arxiv.org/abs/2406.11944
arXiv 2024
-
[10]
T iny S tories: How small can language models be and still speak coherent E nglish?, 2023
Ronen Eldan and Yuanzhi Li. T iny S tories: How small can language models be and still speak coherent E nglish?, 2023. URL https://arxiv.org/abs/2305.07759
Pith/arXiv arXiv 2023
-
[11]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
2021
-
[12]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Benjamin Mann, Amanda Askell, Stephanie Lin, Adam Scherlis, Nova DasSarma, Sam McCandlish, Dario Amodei, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread (Distill), 2021 b . URL: https://transformer-circuits.pub/2021/framework/index.html
2021
-
[13]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022
Pith/arXiv arXiv 2022
-
[14]
Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. Science, 315 0 (5814): 0 972--976, 2007. doi:10.1126/science.1136800. URL https://www.science.org/doi/10.1126/science.1136800
-
[15]
Positional kernels of attention heads
Alex Gibson. Positional kernels of attention heads. LessWrong blog post, 2025. URL https://www.lesswrong.com/posts/9paB7YhxzsrBoXN8L/positional-kernels-of-attention-heads. Published March 10, 2025
2025
-
[16]
Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of Game Theory, 28 0 (4): 0 547--565, nov 1999. ISSN 1432-1270. doi:10.1007/s001820050125. URL https://doi.org/10.1007/s001820050125
-
[17]
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, 202...
Pith/arXiv arXiv 2024
-
[18]
Grokking modular arithmetic, 2023
Andrey Gromov. Grokking modular arithmetic, 2023. URL https://arxiv.org/abs/2301.02679
arXiv 2023
-
[19]
Compact proofs of model performance via mechanistic interpretability, 2024
Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, and Lawrence Chan. Compact proofs of model performance via mechanistic interpretability, 2024. URL https://arxiv.org/abs/2406.11779
arXiv 2024
-
[20]
You can remove gpt2's layernorm by fine-tuning, 2024
Stefan Heimersheim. You can remove gpt2's layernorm by fine-tuning, 2024. URL https://arxiv.org/abs/2409.13710
arXiv 2024
-
[21]
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...
Pith/arXiv arXiv 2024
-
[22]
A gentle introduction to mechanistic anomaly detection
Erik Jenner. A gentle introduction to mechanistic anomaly detection. https://www.lesswrong.com/posts/n7DFwtJvCzkuKmtbG/a-gentle-introduction-to-mechanistic-anomaly-detection, April 2024. LessWrong post
2024
-
[23]
Johnston, Arkajyoti Chakraborty, and Nora Belrose
David O. Johnston, Arkajyoti Chakraborty, and Nora Belrose. Mechanistic anomaly detection for "quirky" language models, 2025. URL https://arxiv.org/abs/2504.08812
arXiv 2025
-
[24]
Sparse crosscoders for cross-layer features and model diffing
Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse crosscoders for cross-layer features and model diffing. https://transformer-circuits.pub/2024/crosscoders/index.html, October 2024 a . Transformer Circuits research update
2024
-
[25]
Sparse crosscoders for cross-layer features and model diffing
Jack Lindsey, Adly Templeton, Jonathan Marcus, Tom Conerly, Joshua Baston, and Chris Olah. Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread, 2024 b . URL https://transformer-circuits.pub/2024/crosscoders/index.html
2024
-
[26]
Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller
Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2025. URL https://arxiv.org/abs/2403.19647
Pith/arXiv arXiv 2025
-
[27]
Robustly identifying concepts introduced during chat fine-tuning using crosscoders
Julian Minder, Cl \'e ment Dumas, Caden Juang, Bilal Chugtai, and Neel Nanda. Robustly identifying concepts introduced during chat fine-tuning using crosscoders. arXiv preprint arXiv:2504.02922, 2025
arXiv 2025
-
[28]
shapiq: Shapley interactions for machine learning
Maximilian Muschalik, Hubert Baniecki, Fabian Fumagalli, Patrick Kolpaczki, Barbara Hammer, and Eyke H\" u llermeier. shapiq: Shapley interactions for machine learning. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=knxGmi6SJi
2024
-
[29]
Progress measures for grokking via mechanistic interpretability, 2023
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. URL https://arxiv.org/abs/2301.05217
Pith/arXiv arXiv 2023
-
[30]
Zoom in: An introduction to circuits
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill 5(3): e00024.001, 2020. URL https://distill.pub/2020/circuits/zoom-in/
2020
-
[31]
Transcoders beat sparse autoencoders for interpretability, 2025
Gonçalo Paulo, Stepan Shabalin, and Nora Belrose. Transcoders beat sparse autoencoders for interpretability, 2025. URL https://arxiv.org/abs/2501.18823
arXiv 2025
-
[32]
Improving dictionary learning with gated sparse autoencoders, 2024
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders, 2024. URL https://arxiv.org/abs/2404.16014
Pith/arXiv arXiv 2024
-
[33]
Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry. Towards verified artificial intelligence, 2020. URL https://arxiv.org/abs/1606.08514
arXiv 2020
-
[34]
[replication] crosscoder-based stage-wise model diffing
Anna Soligo, Thomas Read, Oliver Clive-Griffin, Dmitry Manning-Coe, Chun-Hei Yip, Rajashree Agrawal, and Jason Gross. [replication] crosscoder-based stage-wise model diffing. AI Alignment Forum, 2025. https://www.alignmentforum.org/posts/hxxramAB82tjtpiQu/replication-crosscoder-based-stage-wise-model-diffing-2
2025
-
[35]
Daniel Freeman, Theodore R
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...
2024
-
[36]
Faith-shap: The faithful shapley interaction index, 2023
Che-Ping Tsai, Chih-Kuan Yeh, and Pradeep Ravikumar. Faith-shap: The faithful shapley interaction index, 2023. URL https://arxiv.org/abs/2203.00870
arXiv 2023
-
[37]
How does this interaction affect me? interpretable attribution for feature interactions
Michael Tsang, Sirisha Rambhatla, and Yan Liu. How does this interaction affect me? interpretable attribution for feature interactions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6147--6159. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc...
2020
-
[38]
Towards a unified and verified understanding of group-operation networks, 2025
Wilson Wu, Louis Jaburi, Jacob Drori, and Jason Gross. Towards a unified and verified understanding of group-operation networks, 2025. URL https://arxiv.org/abs/2410.07476
arXiv 2025
-
[39]
Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, and Jason Gross. Modular addition without black-boxes: Compressing explanations of mlps that compute numerical integration, 2024. URL https://arxiv.org/abs/2412.03773
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.