Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretabil- ity.arXiv preprint arXiv:2503.09532

Karvonen, A · 2025 · arXiv 2503.09532

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

HH-SAE factorizes manifolds into nested contextual (L0), atomic (f1), and compository (f2) tiers, achieving 0.9156 cross-domain zero-shot AUC in fraud detection and +9.9% AUPRC lift in steered synthesis.

Structural Instability of Feature Composition

cs.LG · 2026-04-18 · unverdicted · novelty 7.0

Feature composition in SAEs collapses asymptotically when the Gaussian mean width of the signal cone is exceeded, with ReLU inducing a ratchet-like accumulation of interference from correlations.

Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

cs.LG · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

cs.AI · 2026-05-07 · conditional · novelty 6.0

Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

cs.LG · 2025-09-11 · unverdicted · novelty 5.0

Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.

citing papers explorer

Showing 7 of 7 citing papers.

HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds cs.LG · 2026-05-11 · unverdicted · none · ref 18
HH-SAE factorizes manifolds into nested contextual (L0), atomic (f1), and compository (f2) tiers, achieving 0.9156 cross-domain zero-shot AUC in fraud detection and +9.9% AUPRC lift in steered synthesis.
Structural Instability of Feature Composition cs.LG · 2026-04-18 · unverdicted · none · ref 4
Feature composition in SAEs collapses asymptotically when the Gaussian mean width of the signal cone is exceeded, with ReLU inducing a ratchet-like accumulation of interference from correlations.
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders cs.LG · 2026-05-08 · unverdicted · none · ref 12 · 2 links
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features cs.AI · 2026-05-07 · conditional · none · ref 10
Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 34
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 148
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework cs.LG · 2025-09-11 · unverdicted · none · ref 20
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretabil- ity.arXiv preprint arXiv:2503.09532

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer