An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.
Diab, Virginia Smith, and Kun Zhang
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2representative citing papers
citing papers explorer
-
Are Sparse Autoencoder Benchmarks Reliable?
An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.
- Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)