Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, et al · 2025 · arXiv 2503.09532

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

A Unifying Framework for Concept-Based Representational Similarity

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

A unifying framework decomposes concept alignment into instance-wise and distributional translation and concept consistency, introduces the InterVenchA benchmark, and shows that joint optimization via CoSAE recovers strong alignment even with 0.1% paired data.

Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations

cs.LG · 2026-05-27 · conditional · novelty 7.0

SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.

HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

HH-SAE factorizes manifolds into nested contextual (L0), atomic (f1), and compository (f2) tiers, achieving 0.9156 cross-domain zero-shot AUC in fraud detection and +9.9% AUPRC lift in steered synthesis.

Structural Instability of Feature Composition

cs.LG · 2026-04-18 · unverdicted · novelty 7.0

Feature composition in SAEs collapses asymptotically when the Gaussian mean width of the signal cone is exceeded, with ReLU inducing a ratchet-like accumulation of interference from correlations.

Do Sparse Autoencoders Learn Meaningful Concept Hierarchies?

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Sparse autoencoders provide a basis for sensible concept hierarchies on visual data but are undermined by hard and soft feature absorption.

Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

cs.LG · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

cs.AI · 2026-05-07 · conditional · novelty 6.0

Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

cs.LG · 2025-09-11 · unverdicted · novelty 5.0

Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.

citing papers explorer

Showing 8 of 8 citing papers after filters.

A Unifying Framework for Concept-Based Representational Similarity cs.LG · 2026-06-08 · unverdicted · none · ref 18
A unifying framework decomposes concept alignment into instance-wise and distributional translation and concept consistency, introduces the InterVenchA benchmark, and shows that joint optimization via CoSAE recovers strong alignment even with 0.1% paired data.
HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds cs.LG · 2026-05-11 · unverdicted · none · ref 18
HH-SAE factorizes manifolds into nested contextual (L0), atomic (f1), and compository (f2) tiers, achieving 0.9156 cross-domain zero-shot AUC in fraud detection and +9.9% AUPRC lift in steered synthesis.
Structural Instability of Feature Composition cs.LG · 2026-04-18 · unverdicted · none · ref 4
Feature composition in SAEs collapses asymptotically when the Gaussian mean width of the signal cone is exceeded, with ReLU inducing a ratchet-like accumulation of interference from correlations.
Do Sparse Autoencoders Learn Meaningful Concept Hierarchies? cs.LG · 2026-06-22 · unverdicted · none · ref 20
Sparse autoencoders provide a basis for sensible concept hierarchies on visual data but are undermined by hard and soft feature absorption.
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders cs.LG · 2026-05-08 · unverdicted · none · ref 12 · 2 links
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 34
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 148
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework cs.LG · 2025-09-11 · unverdicted · none · ref 20
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer