pith. sign in

arxiv: 2605.18229 · v1 · pith:GTZ44TTJnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Are Sparse Autoencoder Benchmarks Reliable?

Pith reviewed 2026-05-20 12:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sparse autoencodersSAE evaluationinterpretability metricsbenchmark reliabilitylanguage model featuresmetric auditing
0
0 comments X

The pith

Two common metrics for measuring sparse autoencoder quality produce inconsistent results and should be dropped from evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits the standard benchmarks used to compare sparse autoencoders for language model interpretability. It applies three tests to the metrics: how much scores change when the same autoencoder is retrained with a new random seed, how well scores track known ground-truth features in synthetic data, and how well scores distinguish autoencoders trained along different trajectories. Two metrics, targeted probe perturbation and spurious correlation removal, fail multiple tests at their usual settings. The remaining metrics show more noise and weaker separation power than commonly assumed, though one variant of probing performs better than the rest.

Core claim

Auditing the SAEBench metrics through reseed noise on fixed SAEs, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories shows that targeted probe perturbation and spurious correlation removal fail at canonical settings and should not be used, while other metrics exhibit higher reseed noise and lower discriminability than assumed, with the sae-probes variant of k-sparse probing emerging as the most reliable option tested.

What carries the argument

Three auditing lenses (reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories) applied to the SAEBench metrics.

If this is right

  • Evaluations that relied on targeted probe perturbation or spurious correlation removal may have mis-ranked sparse autoencoders.
  • The sae-probes variant should be preferred over the other metrics when comparing current sparse autoencoder designs.
  • Progress on sparse autoencoder architectures will require new evaluation methods that separate models more cleanly.
  • Existing published comparisons of sparse autoencoders should be re-examined for metric-induced errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Without reliable metrics, claims about which sparse autoencoder architecture best captures model features rest on shaky ground.
  • The field could test whether combining the surviving metrics with new synthetic datasets improves separation between similar architectures.
  • If the same noise patterns appear in other interpretability tools, similar auditing procedures might reveal hidden weaknesses there as well.

Load-bearing premise

The three chosen lenses are enough to decide whether a metric can be trusted when evaluating real sparse autoencoders.

What would settle it

A controlled experiment in which targeted probe perturbation or spurious correlation removal produces stable, ground-truth-aligned rankings on a set of SAEs whose true feature quality is independently verified.

Figures

Figures reproduced from arXiv: 2605.18229 by David Chanin.

Figure 1
Figure 1. Figure 1: Trajectories of ten representative metrics across the 1.5B-token training run in the cross [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trajectories of the same ten headline metrics as Figure 1, on the four-SAE Matryoshka set [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SynthSAEBench-16k contains a 16,384-feature ground-truth dictionary broken into hi [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Benchmark score vs GT-MCC at canonical hyperparameters, each point is one SAE. Sparse [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GT-F1 (left) and GT-MCC (right) vs measured L0 for each SAE architecture in the v1 panel. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Spearman ρ between benchmark score and GT-MCC across the v1 synthetic panel as a function of the benchmark’s primary hyperparameter (top-k for sparse probing, top-n for TPP and SCR), single seed (1234). Canonical-hparam points (top-k = 16 for sparse probing, top￾n ∈ {10, 50, 500} for SCR, top-n = 10 for TPP) use the full 35-trained-SAE panel; non-canonical points use the 15-trained-SAE sub-panel from the o… view at source ↗
Figure 7
Figure 7. Figure 7: Sparse probing: GT-MCC vs benchmark score across all task categories (rows) and top- [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TPP: GT-MCC vs benchmark score across all-in-sae and all-out-of-sae sibling groups [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: SCR (T=in, S=in): GT-MCC vs benchmark score across top- [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Signal-to-noise ratio of every SAEBench metric on the four-SAE architecture snapshot [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Pairwise Spearman ρ between SAEBench metric rankings on the four-SAE architectures panel (left) and the four-SAE sampled-Matryoshka panel (right; n ∈ {1, 2, 3, 4}). Each cell is the rank correlation between two metrics’ final-snapshot rankings of the four SAEs in that panel, oriented so positive ρ means agreement on which SAE is better. Subtitle reports the mean off-diagonal ρ across all SNR-informative m… view at source ↗
Figure 12
Figure 12. Figure 12: Per-metric variance decomposition on the four-variant Matryoshka panel (3 seeds [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Trajectories of every SAEBench metric across the four-SAE architecture training [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Trajectories of every SAEBench metric across the four-SAE sampled-Matryoshka training [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
read the original abstract

Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript audits the reliability of SAE quality metrics in the SAEBench suite using three lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. It concludes that Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR) fail multiple lenses at canonical settings and should not be used to evaluate SAEs; the remaining metrics exhibit higher reseed noise and lower discriminability than typically assumed; and the sae-probes variant of k-sparse probing is the most reliable metric tested, although it still struggles to separate variants of the same SAE architecture.

Significance. If the central findings hold, the work is significant for mechanistic interpretability research because it supplies concrete empirical evidence that two widely adopted SAE metrics are unreliable, which could prevent misallocation of effort toward architectures that only appear better under flawed benchmarks. The multi-lens auditing strategy (reseed, synthetic ground-truth, and trajectory discriminability) offers a reusable template for validating future benchmarks, and the direct empirical tests against external criteria rather than self-referential fits strengthen the falsifiability of the claims.

major comments (2)
  1. [§4] §4 (Ground-truth correlation lens): The synthetic SAE generator is asserted to reproduce the correlation structures, spurious feature overlaps, and probe behaviors of real LLM activations, yet the manuscript provides no quantitative validation (e.g., comparison of activation sparsity histograms, pairwise correlation matrices, or probe accuracy distributions) against residuals from models such as Pythia or Llama. Without this check, failures of TPP and SCR on synthetic data do not necessarily imply the same failures will occur when the metrics are applied to real SAEs.
  2. [§5.2] §5.2 (Discriminability across training trajectories): The claim that sae-probes is the most reliable metric rests on its lower reseed noise and better separation of training trajectories, but the reported results lack effect sizes, confidence intervals, or statistical tests comparing it to the other metrics; this makes it difficult to assess whether the observed advantage is robust or merely descriptive.
minor comments (2)
  1. [Abstract] The abstract states that 'the other metrics show higher reseed noise' without naming the metrics or quantifying the increase relative to sae-probes; adding a short table or sentence with the relevant numbers would improve clarity.
  2. Figure captions for the reseed-noise and discriminability plots do not indicate whether error bars represent standard deviation across seeds or across SAE variants; this notation should be standardized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and will revise the manuscript to strengthen the presentation of our auditing methodology and results.

read point-by-point responses
  1. Referee: [§4] §4 (Ground-truth correlation lens): The synthetic SAE generator is asserted to reproduce the correlation structures, spurious feature overlaps, and probe behaviors of real LLM activations, yet the manuscript provides no quantitative validation (e.g., comparison of activation sparsity histograms, pairwise correlation matrices, or probe accuracy distributions) against residuals from models such as Pythia or Llama. Without this check, failures of TPP and SCR on synthetic data do not necessarily imply the same failures will occur when the metrics are applied to real SAEs.

    Authors: We agree that direct quantitative validation of the synthetic generator against real LLM residuals would improve the strength of the ground-truth lens. The generator was constructed to match documented statistical properties of LLM activations (sparsity, feature correlations, and probe behaviors) drawn from prior work, but we did not include side-by-side empirical comparisons in the submitted manuscript. In revision we will add these comparisons—sparsity histograms, pairwise correlation matrices, and probe accuracy distributions—using residuals from Pythia and Llama models. This will clarify the degree of fidelity and better justify applying the synthetic results to real SAE evaluation. revision: yes

  2. Referee: [§5.2] §5.2 (Discriminability across training trajectories): The claim that sae-probes is the most reliable metric rests on its lower reseed noise and better separation of training trajectories, but the reported results lack effect sizes, confidence intervals, or statistical tests comparing it to the other metrics; this makes it difficult to assess whether the observed advantage is robust or merely descriptive.

    Authors: We accept that the discriminability results would benefit from formal statistical support. The original figures were intended to illustrate qualitative trends across trajectories, but we will augment the revised manuscript with effect sizes (Cohen’s d), bootstrap confidence intervals on reseed noise, and statistical comparisons (paired t-tests or Wilcoxon signed-rank tests) between metrics. These additions will allow readers to evaluate the robustness of sae-probes’ relative advantage with greater precision. revision: yes

Circularity Check

0 steps flagged

Empirical audit uses external benchmarks with no reduction to inputs by construction

full rationale

The paper audits SAE metrics via three independent empirical lenses—reseed noise on fixed SAEs, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories—none of which are defined in terms of the metrics under test or derived from fitted parameters presented as predictions. No equations, self-definitional steps, or load-bearing self-citations appear in the provided text; claims rest on direct experimental outcomes against external criteria rather than renaming or smuggling prior results. This is a standard self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the three chosen lenses accurately diagnose metric reliability for SAE evaluation in practice.

axioms (1)
  • domain assumption Reseed noise, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories are valid and sufficient criteria for assessing whether a metric reliably evaluates SAEs.
    This premise is invoked to justify the audit approach and the recommendation against using certain metrics.

pith-pipeline@v0.9.0 · 5677 in / 1181 out tokens · 48829 ms · 2026-05-20T12:39:11.369852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

  1. [1]

    Language models can explain neurons in language models

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. 2023, 2023. URL https://openaipublic.blob.core.windows.net/n euron-explainer/paper/index.html

  2. [2]

    Bowman and George Dahl

    Samuel R. Bowman and George Dahl. What will it take to fix benchmarking in natural language understanding? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4843–4855, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naa...

  3. [3]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2, 2023

  4. [4]

    Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

  5. [5]

    Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S

    Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547, 2025

  6. [6]

    With little power comes great responsibility

    Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. With little power comes great responsibility. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  7. [7]

    Synthsaebench: Evaluating sparse autoencoders on scalable realistic synthetic data.arXiv preprint arXiv:2602.14687, 2026

    David Chanin and Adrià Garriga-Alonso. Synthsaebench: Evaluating sparse autoencoders on scalable realistic synthetic data.arXiv preprint arXiv:2602.14687, 2026

  8. [8]

    A is for absorption: Studying feature splitting and absorption in sparse autoen- coders.Advances in Neural Information Processing Systems, 38:82318–82355, 2026

    David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoen- coders.Advances in Neural Information Processing Systems, 38:82318–82355, 2026

  9. [9]

    Sparse autoencoders find highly interpretable features in language models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InInternational Conference on Learning Representations, volume 2024, pages 7827–7845, 2024

  10. [10]

    Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan

    Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

  11. [11]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

  12. [12]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  13. [13]

    Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023

    Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/for um?id=JYs1R9IMJr. 10

  14. [14]

    Steven L

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. doi: 10.1609/aaai.v32i1.11694. URL https://ojs.aaai.org /index.php/AAAI/article/view/11694

  15. [15]

    Ravel: Evaluat- ing interpretability methods on disentangling language model representations

    Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. Ravel: Evaluat- ing interpretability methods on disentangling language model representations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8669–8687, 2024

  16. [16]

    Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025

    Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025

  17. [17]

    Measuring progress in dictionary learn- ing for language model interpretability with board game models

    Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learn- ing for language model interpretability with board game models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural ...

  18. [18]

    Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025

    Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503 .09532

  19. [19]

    Sanity checks for sparse autoencoders: Do saes beat random baselines?arXiv preprint arXiv:2602.14111, 2026

    Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, and Elena Tutubalina. Sanity checks for sparse autoencoders: Do saes beat random baselines?arXiv preprint arXiv:2602.14111, 2026

  20. [20]

    Are we learning yet? a meta review of evaluation failures across machine learning

    Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=mPducS1MsEK

  21. [21]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024

    Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024

  22. [22]

    Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

    Aleksandar Makelov, Georg Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

  23. [23]

    Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

    Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv

  24. [24]

    Gemma scope 2: Technical paper

    Callum McDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan, and Neel Nanda. Gemma scope 2: Technical paper. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/gem ma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-com plex-language-model-behavior/Gemma_Scop...

  25. [25]

    Compute optimal inference and provable amortisation gap in sparse autoencoders

    Charles O’Neill, Alim Gumran, and David Klindt. Compute optimal inference and provable amortisation gap in sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=8forr1FkvC

  26. [26]

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

    Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features.arXiv preprint arXiv:2501.16615, 2025. 11

  27. [27]

    Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

    Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

  28. [28]

    Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging

    Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

  29. [29]

    Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research (mechanistic interpretability team progress update)

    Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research (mechanistic interpretability team progress update). Google DeepMind Safety Research, Medium, 2025. URL https://deepmindsafe tyresearch.m...

  30. [30]

    Diab, Virginia Smith, and Kun Zhang

    Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI

  31. [31]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

  32. [32]

    Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

    Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025. 12 A Limitations The synthetic-SAE correlation analysis (§5) on SynthSAEBench-16k uses three task-generation see...

  33. [33]

    how good is this SAE at a given prefix width w?

    follows a t-distribution with n−1 = 4degrees of freedom; the 95% two-tailed threshold is therefore |∆|∗ =t 0.025,4 ·s √ 2≈3.93s. The equivalent multiplier under known σ would be 1.96 √ 2≈2.77 ; the inflation factor t0.025,4/z0.025 ≈1.41 reflects the chi-squared uncertainty in s given only 5 reseeds (the 95% CI on σ from n= 5 samples is roughly 0.6s to 2.9...