Are Sparse Autoencoder Benchmarks Reliable?
Pith reviewed 2026-05-20 12:39 UTC · model grok-4.3
The pith
Two common metrics for measuring sparse autoencoder quality produce inconsistent results and should be dropped from evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Auditing the SAEBench metrics through reseed noise on fixed SAEs, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories shows that targeted probe perturbation and spurious correlation removal fail at canonical settings and should not be used, while other metrics exhibit higher reseed noise and lower discriminability than assumed, with the sae-probes variant of k-sparse probing emerging as the most reliable option tested.
What carries the argument
Three auditing lenses (reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories) applied to the SAEBench metrics.
If this is right
- Evaluations that relied on targeted probe perturbation or spurious correlation removal may have mis-ranked sparse autoencoders.
- The sae-probes variant should be preferred over the other metrics when comparing current sparse autoencoder designs.
- Progress on sparse autoencoder architectures will require new evaluation methods that separate models more cleanly.
- Existing published comparisons of sparse autoencoders should be re-examined for metric-induced errors.
Where Pith is reading between the lines
- Without reliable metrics, claims about which sparse autoencoder architecture best captures model features rest on shaky ground.
- The field could test whether combining the surviving metrics with new synthetic datasets improves separation between similar architectures.
- If the same noise patterns appear in other interpretability tools, similar auditing procedures might reveal hidden weaknesses there as well.
Load-bearing premise
The three chosen lenses are enough to decide whether a metric can be trusted when evaluating real sparse autoencoders.
What would settle it
A controlled experiment in which targeted probe perturbation or spurious correlation removal produces stable, ground-truth-aligned rankings on a set of SAEs whose true feature quality is independently verified.
Figures
read the original abstract
Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript audits the reliability of SAE quality metrics in the SAEBench suite using three lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. It concludes that Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR) fail multiple lenses at canonical settings and should not be used to evaluate SAEs; the remaining metrics exhibit higher reseed noise and lower discriminability than typically assumed; and the sae-probes variant of k-sparse probing is the most reliable metric tested, although it still struggles to separate variants of the same SAE architecture.
Significance. If the central findings hold, the work is significant for mechanistic interpretability research because it supplies concrete empirical evidence that two widely adopted SAE metrics are unreliable, which could prevent misallocation of effort toward architectures that only appear better under flawed benchmarks. The multi-lens auditing strategy (reseed, synthetic ground-truth, and trajectory discriminability) offers a reusable template for validating future benchmarks, and the direct empirical tests against external criteria rather than self-referential fits strengthen the falsifiability of the claims.
major comments (2)
- [§4] §4 (Ground-truth correlation lens): The synthetic SAE generator is asserted to reproduce the correlation structures, spurious feature overlaps, and probe behaviors of real LLM activations, yet the manuscript provides no quantitative validation (e.g., comparison of activation sparsity histograms, pairwise correlation matrices, or probe accuracy distributions) against residuals from models such as Pythia or Llama. Without this check, failures of TPP and SCR on synthetic data do not necessarily imply the same failures will occur when the metrics are applied to real SAEs.
- [§5.2] §5.2 (Discriminability across training trajectories): The claim that sae-probes is the most reliable metric rests on its lower reseed noise and better separation of training trajectories, but the reported results lack effect sizes, confidence intervals, or statistical tests comparing it to the other metrics; this makes it difficult to assess whether the observed advantage is robust or merely descriptive.
minor comments (2)
- [Abstract] The abstract states that 'the other metrics show higher reseed noise' without naming the metrics or quantifying the increase relative to sae-probes; adding a short table or sentence with the relevant numbers would improve clarity.
- Figure captions for the reseed-noise and discriminability plots do not indicate whether error bars represent standard deviation across seeds or across SAE variants; this notation should be standardized.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and will revise the manuscript to strengthen the presentation of our auditing methodology and results.
read point-by-point responses
-
Referee: [§4] §4 (Ground-truth correlation lens): The synthetic SAE generator is asserted to reproduce the correlation structures, spurious feature overlaps, and probe behaviors of real LLM activations, yet the manuscript provides no quantitative validation (e.g., comparison of activation sparsity histograms, pairwise correlation matrices, or probe accuracy distributions) against residuals from models such as Pythia or Llama. Without this check, failures of TPP and SCR on synthetic data do not necessarily imply the same failures will occur when the metrics are applied to real SAEs.
Authors: We agree that direct quantitative validation of the synthetic generator against real LLM residuals would improve the strength of the ground-truth lens. The generator was constructed to match documented statistical properties of LLM activations (sparsity, feature correlations, and probe behaviors) drawn from prior work, but we did not include side-by-side empirical comparisons in the submitted manuscript. In revision we will add these comparisons—sparsity histograms, pairwise correlation matrices, and probe accuracy distributions—using residuals from Pythia and Llama models. This will clarify the degree of fidelity and better justify applying the synthetic results to real SAE evaluation. revision: yes
-
Referee: [§5.2] §5.2 (Discriminability across training trajectories): The claim that sae-probes is the most reliable metric rests on its lower reseed noise and better separation of training trajectories, but the reported results lack effect sizes, confidence intervals, or statistical tests comparing it to the other metrics; this makes it difficult to assess whether the observed advantage is robust or merely descriptive.
Authors: We accept that the discriminability results would benefit from formal statistical support. The original figures were intended to illustrate qualitative trends across trajectories, but we will augment the revised manuscript with effect sizes (Cohen’s d), bootstrap confidence intervals on reseed noise, and statistical comparisons (paired t-tests or Wilcoxon signed-rank tests) between metrics. These additions will allow readers to evaluate the robustness of sae-probes’ relative advantage with greater precision. revision: yes
Circularity Check
Empirical audit uses external benchmarks with no reduction to inputs by construction
full rationale
The paper audits SAE metrics via three independent empirical lenses—reseed noise on fixed SAEs, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories—none of which are defined in terms of the metrics under test or derived from fitted parameters presented as predictions. No equations, self-definitional steps, or load-bearing self-citations appear in the provided text; claims rest on direct experimental outcomes against external criteria rather than renaming or smuggling prior results. This is a standard self-contained empirical evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reseed noise, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories are valid and sufficient criteria for assessing whether a metric reliably evaluates SAEs.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We audit the SAE quality metrics in SAEBench... through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TPP scores worse the more an SAE is trained, and SCR becomes negatively correlated with ground-truth at large top-N.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Language models can explain neurons in language models
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. 2023, 2023. URL https://openaipublic.blob.core.windows.net/n euron-explainer/paper/index.html
work page 2023
-
[2]
Samuel R. Bowman and George Dahl. What will it take to fix benchmarking in natural language understanding? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4843–4855, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naa...
-
[3]
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2, 2023
work page 2023
-
[4]
Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024
Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024
-
[5]
Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S
Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547, 2025
-
[6]
With little power comes great responsibility
Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. With little power comes great responsibility. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
work page 2020
-
[7]
David Chanin and Adrià Garriga-Alonso. Synthsaebench: Evaluating sparse autoencoders on scalable realistic synthetic data.arXiv preprint arXiv:2602.14687, 2026
-
[8]
David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoen- coders.Advances in Neural Information Processing Systems, 38:82318–82355, 2026
work page 2026
-
[9]
Sparse autoencoders find highly interpretable features in language models
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InInternational Conference on Learning Representations, volume 2024, pages 7827–7845, 2024
work page 2024
-
[10]
Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan
Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021
-
[11]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[13]
Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/for um?id=JYs1R9IMJr. 10
work page 2023
-
[14]
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. doi: 10.1609/aaai.v32i1.11694. URL https://ojs.aaai.org /index.php/AAAI/article/view/11694
-
[15]
Ravel: Evaluat- ing interpretability methods on disentangling language model representations
Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. Ravel: Evaluat- ing interpretability methods on disentangling language model representations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8669–8687, 2024
work page 2024
-
[16]
Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025
Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing.arXiv preprint arXiv:2502.16681, 2025
-
[17]
Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learn- ing for language model interpretability with board game models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural ...
-
[18]
Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025
Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503 .09532
work page 2025
-
[19]
Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, and Elena Tutubalina. Sanity checks for sparse autoencoders: Do saes beat random baselines?arXiv preprint arXiv:2602.14111, 2026
-
[20]
Are we learning yet? a meta review of evaluation failures across machine learning
Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=mPducS1MsEK
work page 2021
-
[21]
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024
Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024
work page 2024
-
[22]
Aleksandar Makelov, Georg Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024
-
[23]
Sparse feature circuits: Discovering and editing interpretable causal graphs in language models
Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv
work page 2025
-
[24]
Gemma scope 2: Technical paper
Callum McDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan, and Neel Nanda. Gemma scope 2: Technical paper. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/gem ma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-com plex-language-model-behavior/Gemma_Scop...
work page 2025
-
[25]
Compute optimal inference and provable amortisation gap in sparse autoencoders
Charles O’Neill, Alim Gumran, and David Klindt. Compute optimal inference and provable amortisation gap in sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=8forr1FkvC
work page 2025
-
[26]
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner
Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features.arXiv preprint arXiv:2501.16615, 2025. 11
-
[27]
Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024
-
[28]
Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017
work page 2017
-
[29]
Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research (mechanistic interpretability team progress update). Google DeepMind Safety Research, Medium, 2025. URL https://deepmindsafe tyresearch.m...
work page 2025
-
[30]
Diab, Virginia Smith, and Kun Zhang
Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI
work page 2025
-
[31]
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025. 12 A Limitations The synthetic-SAE correlation analysis (§5) on SynthSAEBench-16k uses three task-generation see...
-
[33]
how good is this SAE at a given prefix width w?
follows a t-distribution with n−1 = 4degrees of freedom; the 95% two-tailed threshold is therefore |∆|∗ =t 0.025,4 ·s √ 2≈3.93s. The equivalent multiplier under known σ would be 1.96 √ 2≈2.77 ; the inflation factor t0.025,4/z0.025 ≈1.41 reflects the chi-squared uncertainty in s given only 5 reseeds (the 95% CI on σ from n= 5 samples is roughly 0.6s to 2.9...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.