Are Sparse Autoencoder Benchmarks Reliable?

David Chanin

arxiv: 2605.18229 · v1 · pith:GTZ44TTJnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Are Sparse Autoencoder Benchmarks Reliable?

David Chanin This is my paper

Pith reviewed 2026-05-20 12:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse autoencodersSAE evaluationinterpretability metricsbenchmark reliabilitylanguage model featuresmetric auditing

0 comments

The pith

Two common metrics for measuring sparse autoencoder quality produce inconsistent results and should be dropped from evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits the standard benchmarks used to compare sparse autoencoders for language model interpretability. It applies three tests to the metrics: how much scores change when the same autoencoder is retrained with a new random seed, how well scores track known ground-truth features in synthetic data, and how well scores distinguish autoencoders trained along different trajectories. Two metrics, targeted probe perturbation and spurious correlation removal, fail multiple tests at their usual settings. The remaining metrics show more noise and weaker separation power than commonly assumed, though one variant of probing performs better than the rest.

Core claim

Auditing the SAEBench metrics through reseed noise on fixed SAEs, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories shows that targeted probe perturbation and spurious correlation removal fail at canonical settings and should not be used, while other metrics exhibit higher reseed noise and lower discriminability than assumed, with the sae-probes variant of k-sparse probing emerging as the most reliable option tested.

What carries the argument

Three auditing lenses (reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories) applied to the SAEBench metrics.

If this is right

Evaluations that relied on targeted probe perturbation or spurious correlation removal may have mis-ranked sparse autoencoders.
The sae-probes variant should be preferred over the other metrics when comparing current sparse autoencoder designs.
Progress on sparse autoencoder architectures will require new evaluation methods that separate models more cleanly.
Existing published comparisons of sparse autoencoders should be re-examined for metric-induced errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Without reliable metrics, claims about which sparse autoencoder architecture best captures model features rest on shaky ground.
The field could test whether combining the surviving metrics with new synthetic datasets improves separation between similar architectures.
If the same noise patterns appear in other interpretability tools, similar auditing procedures might reveal hidden weaknesses there as well.

Load-bearing premise

The three chosen lenses are enough to decide whether a metric can be trusted when evaluating real sparse autoencoders.

What would settle it

A controlled experiment in which targeted probe perturbation or spurious correlation removal produces stable, ground-truth-aligned rankings on a set of SAEs whose true feature quality is independently verified.

Figures

Figures reproduced from arXiv: 2605.18229 by David Chanin.

**Figure 2.** Figure 2: Trajectories of the same ten headline metrics as Figure 1, on the four-SAE Matryoshka set [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: SynthSAEBench-16k contains a 16,384-feature ground-truth dictionary broken into hi [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Benchmark score vs GT-MCC at canonical hyperparameters, each point is one SAE. Sparse [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: GT-F1 (left) and GT-MCC (right) vs measured L0 for each SAE architecture in the v1 panel. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Spearman ρ between benchmark score and GT-MCC across the v1 synthetic panel as a function of the benchmark’s primary hyperparameter (top-k for sparse probing, top-n for TPP and SCR), single seed (1234). Canonical-hparam points (top-k = 16 for sparse probing, topn ∈ {10, 50, 500} for SCR, top-n = 10 for TPP) use the full 35-trained-SAE panel; non-canonical points use the 15-trained-SAE sub-panel from the o… view at source ↗

**Figure 7.** Figure 7: Sparse probing: GT-MCC vs benchmark score across all task categories (rows) and top- [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: TPP: GT-MCC vs benchmark score across all-in-sae and all-out-of-sae sibling groups [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: SCR (T=in, S=in): GT-MCC vs benchmark score across top- [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Signal-to-noise ratio of every SAEBench metric on the four-SAE architecture snapshot [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Pairwise Spearman ρ between SAEBench metric rankings on the four-SAE architectures panel (left) and the four-SAE sampled-Matryoshka panel (right; n ∈ {1, 2, 3, 4}). Each cell is the rank correlation between two metrics’ final-snapshot rankings of the four SAEs in that panel, oriented so positive ρ means agreement on which SAE is better. Subtitle reports the mean off-diagonal ρ across all SNR-informative m… view at source ↗

**Figure 12.** Figure 12: Per-metric variance decomposition on the four-variant Matryoshka panel (3 seeds [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Trajectories of every SAEBench metric across the four-SAE architecture training [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: Trajectories of every SAEBench metric across the four-SAE sampled-Matryoshka training [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPP and SCR fail reseed and synthetic tests in the SAEBench audit, but the synthetic lens may not map cleanly to real SAE statistics.

read the letter

Two of the main metrics in SAEBench, Targeted Probe Perturbation and Spurious Correlation Removal, fail the reseed noise test on fixed SAEs and show weak ground-truth correlation on synthetic SAEs at their usual settings. That is the central empirical result, and it lines up with the claim that they should not be used to evaluate SAEs. The sae-probes variant of k-sparse probing comes out as the most stable of the ones examined, though it still has trouble separating close variants of the same architecture.

Referee Report

2 major / 2 minor

Summary. The manuscript audits the reliability of SAE quality metrics in the SAEBench suite using three lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. It concludes that Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR) fail multiple lenses at canonical settings and should not be used to evaluate SAEs; the remaining metrics exhibit higher reseed noise and lower discriminability than typically assumed; and the sae-probes variant of k-sparse probing is the most reliable metric tested, although it still struggles to separate variants of the same SAE architecture.

Significance. If the central findings hold, the work is significant for mechanistic interpretability research because it supplies concrete empirical evidence that two widely adopted SAE metrics are unreliable, which could prevent misallocation of effort toward architectures that only appear better under flawed benchmarks. The multi-lens auditing strategy (reseed, synthetic ground-truth, and trajectory discriminability) offers a reusable template for validating future benchmarks, and the direct empirical tests against external criteria rather than self-referential fits strengthen the falsifiability of the claims.

major comments (2)

[§4] §4 (Ground-truth correlation lens): The synthetic SAE generator is asserted to reproduce the correlation structures, spurious feature overlaps, and probe behaviors of real LLM activations, yet the manuscript provides no quantitative validation (e.g., comparison of activation sparsity histograms, pairwise correlation matrices, or probe accuracy distributions) against residuals from models such as Pythia or Llama. Without this check, failures of TPP and SCR on synthetic data do not necessarily imply the same failures will occur when the metrics are applied to real SAEs.
[§5.2] §5.2 (Discriminability across training trajectories): The claim that sae-probes is the most reliable metric rests on its lower reseed noise and better separation of training trajectories, but the reported results lack effect sizes, confidence intervals, or statistical tests comparing it to the other metrics; this makes it difficult to assess whether the observed advantage is robust or merely descriptive.

minor comments (2)

[Abstract] The abstract states that 'the other metrics show higher reseed noise' without naming the metrics or quantifying the increase relative to sae-probes; adding a short table or sentence with the relevant numbers would improve clarity.
Figure captions for the reseed-noise and discriminability plots do not indicate whether error bars represent standard deviation across seeds or across SAE variants; this notation should be standardized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and will revise the manuscript to strengthen the presentation of our auditing methodology and results.

read point-by-point responses

Referee: [§4] §4 (Ground-truth correlation lens): The synthetic SAE generator is asserted to reproduce the correlation structures, spurious feature overlaps, and probe behaviors of real LLM activations, yet the manuscript provides no quantitative validation (e.g., comparison of activation sparsity histograms, pairwise correlation matrices, or probe accuracy distributions) against residuals from models such as Pythia or Llama. Without this check, failures of TPP and SCR on synthetic data do not necessarily imply the same failures will occur when the metrics are applied to real SAEs.

Authors: We agree that direct quantitative validation of the synthetic generator against real LLM residuals would improve the strength of the ground-truth lens. The generator was constructed to match documented statistical properties of LLM activations (sparsity, feature correlations, and probe behaviors) drawn from prior work, but we did not include side-by-side empirical comparisons in the submitted manuscript. In revision we will add these comparisons—sparsity histograms, pairwise correlation matrices, and probe accuracy distributions—using residuals from Pythia and Llama models. This will clarify the degree of fidelity and better justify applying the synthetic results to real SAE evaluation. revision: yes
Referee: [§5.2] §5.2 (Discriminability across training trajectories): The claim that sae-probes is the most reliable metric rests on its lower reseed noise and better separation of training trajectories, but the reported results lack effect sizes, confidence intervals, or statistical tests comparing it to the other metrics; this makes it difficult to assess whether the observed advantage is robust or merely descriptive.

Authors: We accept that the discriminability results would benefit from formal statistical support. The original figures were intended to illustrate qualitative trends across trajectories, but we will augment the revised manuscript with effect sizes (Cohen’s d), bootstrap confidence intervals on reseed noise, and statistical comparisons (paired t-tests or Wilcoxon signed-rank tests) between metrics. These additions will allow readers to evaluate the robustness of sae-probes’ relative advantage with greater precision. revision: yes

Circularity Check

0 steps flagged

Empirical audit uses external benchmarks with no reduction to inputs by construction

full rationale

The paper audits SAE metrics via three independent empirical lenses—reseed noise on fixed SAEs, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories—none of which are defined in terms of the metrics under test or derived from fitted parameters presented as predictions. No equations, self-definitional steps, or load-bearing self-citations appear in the provided text; claims rest on direct experimental outcomes against external criteria rather than renaming or smuggling prior results. This is a standard self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the three chosen lenses accurately diagnose metric reliability for SAE evaluation in practice.

axioms (1)

domain assumption Reseed noise, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories are valid and sufficient criteria for assessing whether a metric reliably evaluates SAEs.
This premise is invoked to justify the audit approach and the recommendation against using certain metrics.

pith-pipeline@v0.9.0 · 5677 in / 1181 out tokens · 48829 ms · 2026-05-20T12:39:11.369852+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We audit the SAE quality metrics in SAEBench... through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TPP scores worse the more an SAE is trained, and SCR becomes negatively correlated with ground-truth at large top-N.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. 2023, 2023. URL https://openaipublic.blob.core.windows.net/n euron-explainer/paper/index.html

work page 2023
[2]

Bowman and George Dahl

Samuel R. Bowman and George Dahl. What will it take to fix benchmarking in natural language understanding? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4843–4855, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naa...

work page doi:10.18653/v1/2021.naacl-m 2021
[3]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2, 2023

work page 2023
[4]

Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

work page arXiv 2024
[5]

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S

Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547, 2025

work page arXiv 2025
[6]

With little power comes great responsibility

Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. With little power comes great responsibility. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020
[7]

Synthsaebench: Evaluating sparse autoencoders on scalable realistic synthetic data.arXiv preprint arXiv:2602.14687, 2026

David Chanin and Adrià Garriga-Alonso. Synthsaebench: Evaluating sparse autoencoders on scalable realistic synthetic data.arXiv preprint arXiv:2602.14687, 2026

work page arXiv 2026
[8]

A is for absorption: Studying feature splitting and absorption in sparse autoen- coders.Advances in Neural Information Processing Systems, 38:82318–82355, 2026

David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoen- coders.Advances in Neural Information Processing Systems, 38:82318–82355, 2026

work page 2026
[9]

Sparse autoencoders find highly interpretable features in language models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InInternational Conference on Learning Representations, volume 2024, pages 7827–7845, 2024

work page 2024
[10]

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan

Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

work page arXiv 2021
[11]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[13]

Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/for um?id=JYs1R9IMJr. 10

work page 2023
[14]

Steven L

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. doi: 10.1609/aaai.v32i1.11694. URL https://ojs.aaai.org /index.php/AAAI/article/view/11694

work page doi:10.1609/aaai.v32i1.11694 2018
[15]

Ravel: Evaluat- ing interpretability methods on disentangling language model representations

Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. Ravel: Evaluat- ing interpretability methods on disentangling language model representations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8669–8687, 2024

work page 2024
[16]

Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025

Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025

work page arXiv 2025
[17]

Measuring progress in dictionary learn- ing for language model interpretability with board game models

Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learn- ing for language model interpretability with board game models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural ...

work page doi:10.52202/079017-2644 2024
[18]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503 .09532

work page 2025
[19]

Sanity checks for sparse autoencoders: Do saes beat random baselines?arXiv preprint arXiv:2602.14111, 2026

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, and Elena Tutubalina. Sanity checks for sparse autoencoders: Do saes beat random baselines?arXiv preprint arXiv:2602.14111, 2026

work page arXiv 2026
[20]

Are we learning yet? a meta review of evaluation failures across machine learning

Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=mPducS1MsEK

work page 2021
[21]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024

work page 2024
[22]

Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

Aleksandar Makelov, Georg Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

work page arXiv 2024
[23]

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv

work page 2025
[24]

Gemma scope 2: Technical paper

Callum McDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan, and Neel Nanda. Gemma scope 2: Technical paper. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/gem ma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-com plex-language-model-behavior/Gemma_Scop...

work page 2025
[25]

Compute optimal inference and provable amortisation gap in sparse autoencoders

Charles O’Neill, Alim Gumran, and David Klindt. Compute optimal inference and provable amortisation gap in sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=8forr1FkvC

work page 2025
[26]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features.arXiv preprint arXiv:2501.16615, 2025. 11

work page arXiv 2025
[27]

Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

work page arXiv 2024
[28]

Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging

Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

work page 2017
[29]

Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research (mechanistic interpretability team progress update)

Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research (mechanistic interpretability team progress update). Google DeepMind Safety Research, Medium, 2025. URL https://deepmindsafe tyresearch.m...

work page 2025
[30]

Diab, Virginia Smith, and Kun Zhang

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI

work page 2025
[31]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025. 12 A Limitations The synthetic-SAE correlation analysis (§5) on SynthSAEBench-16k uses three task-generation see...

work page arXiv 2025
[33]

how good is this SAE at a given prefix width w?

follows a t-distribution with n−1 = 4degrees of freedom; the 95% two-tailed threshold is therefore |∆|∗ =t 0.025,4 ·s √ 2≈3.93s. The equivalent multiplier under known σ would be 1.96 √ 2≈2.77 ; the inflation factor t0.025,4/z0.025 ≈1.41 reflects the chi-squared uncertainty in s given only 5 reseeds (the 95% CI on σ from n= 5 samples is roughly 0.6s to 2.9...

work page

[1] [1]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. 2023, 2023. URL https://openaipublic.blob.core.windows.net/n euron-explainer/paper/index.html

work page 2023

[2] [2]

Bowman and George Dahl

Samuel R. Bowman and George Dahl. What will it take to fix benchmarking in natural language understanding? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4843–4855, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naa...

work page doi:10.18653/v1/2021.naacl-m 2021

[3] [3]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2, 2023

work page 2023

[4] [4]

Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

work page arXiv 2024

[5] [5]

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S

Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547, 2025

work page arXiv 2025

[6] [6]

With little power comes great responsibility

Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. With little power comes great responsibility. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020

[7] [7]

Synthsaebench: Evaluating sparse autoencoders on scalable realistic synthetic data.arXiv preprint arXiv:2602.14687, 2026

David Chanin and Adrià Garriga-Alonso. Synthsaebench: Evaluating sparse autoencoders on scalable realistic synthetic data.arXiv preprint arXiv:2602.14687, 2026

work page arXiv 2026

[8] [8]

A is for absorption: Studying feature splitting and absorption in sparse autoen- coders.Advances in Neural Information Processing Systems, 38:82318–82355, 2026

David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoen- coders.Advances in Neural Information Processing Systems, 38:82318–82355, 2026

work page 2026

[9] [9]

Sparse autoencoders find highly interpretable features in language models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InInternational Conference on Learning Representations, volume 2024, pages 7827–7845, 2024

work page 2024

[10] [10]

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan

Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

work page arXiv 2021

[11] [11]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[13] [13]

Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/for um?id=JYs1R9IMJr. 10

work page 2023

[14] [14]

Steven L

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. doi: 10.1609/aaai.v32i1.11694. URL https://ojs.aaai.org /index.php/AAAI/article/view/11694

work page doi:10.1609/aaai.v32i1.11694 2018

[15] [15]

Ravel: Evaluat- ing interpretability methods on disentangling language model representations

Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. Ravel: Evaluat- ing interpretability methods on disentangling language model representations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8669–8687, 2024

work page 2024

[16] [16]

Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025

Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025

work page arXiv 2025

[17] [17]

Measuring progress in dictionary learn- ing for language model interpretability with board game models

Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learn- ing for language model interpretability with board game models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural ...

work page doi:10.52202/079017-2644 2024

[18] [18]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503 .09532

work page 2025

[19] [19]

Sanity checks for sparse autoencoders: Do saes beat random baselines?arXiv preprint arXiv:2602.14111, 2026

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, and Elena Tutubalina. Sanity checks for sparse autoencoders: Do saes beat random baselines?arXiv preprint arXiv:2602.14111, 2026

work page arXiv 2026

[20] [20]

Are we learning yet? a meta review of evaluation failures across machine learning

Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=mPducS1MsEK

work page 2021

[21] [21]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024

work page 2024

[22] [22]

Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

Aleksandar Makelov, Georg Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

work page arXiv 2024

[23] [23]

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv

work page 2025

[24] [24]

Gemma scope 2: Technical paper

Callum McDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan, and Neel Nanda. Gemma scope 2: Technical paper. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/gem ma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-com plex-language-model-behavior/Gemma_Scop...

work page 2025

[25] [25]

Compute optimal inference and provable amortisation gap in sparse autoencoders

Charles O’Neill, Alim Gumran, and David Klindt. Compute optimal inference and provable amortisation gap in sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=8forr1FkvC

work page 2025

[26] [26]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features.arXiv preprint arXiv:2501.16615, 2025. 11

work page arXiv 2025

[27] [27]

Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

work page arXiv 2024

[28] [28]

Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging

Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

work page 2017

[29] [29]

Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research (mechanistic interpretability team progress update)

Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research (mechanistic interpretability team progress update). Google DeepMind Safety Research, Medium, 2025. URL https://deepmindsafe tyresearch.m...

work page 2025

[30] [30]

Diab, Virginia Smith, and Kun Zhang

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI

work page 2025

[31] [31]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025. 12 A Limitations The synthetic-SAE correlation analysis (§5) on SynthSAEBench-16k uses three task-generation see...

work page arXiv 2025

[33] [33]

how good is this SAE at a given prefix width w?

follows a t-distribution with n−1 = 4degrees of freedom; the 95% two-tailed threshold is therefore |∆|∗ =t 0.025,4 ·s √ 2≈3.93s. The equivalent multiplier under known σ would be 1.96 √ 2≈2.77 ; the inflation factor t0.025,4/z0.025 ≈1.41 reflects the chi-squared uncertainty in s given only 5 reseeds (the 95% CI on σ from n= 5 samples is roughly 0.6s to 2.9...

work page