A unifying framework decomposes concept alignment into instance-wise and distributional translation and concept consistency, introduces the InterVenchA benchmark, and shows that joint optimization via CoSAE recovers strong alignment even with 0.1% paired data.
Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.
HH-SAE factorizes manifolds into nested contextual (L0), atomic (f1), and compository (f2) tiers, achieving 0.9156 cross-domain zero-shot AUC in fraud detection and +9.9% AUPRC lift in steered synthesis.
Feature composition in SAEs collapses asymptotically when the Gaussian mean width of the signal cone is exceeded, with ReLU inducing a ratchet-like accumulation of interference from correlations.
Sparse autoencoders provide a basis for sensible concept hierarchies on visual data but are undermined by hard and soft feature absorption.
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.
citing papers explorer
-
A Unifying Framework for Concept-Based Representational Similarity
A unifying framework decomposes concept alignment into instance-wise and distributional translation and concept consistency, introduces the InterVenchA benchmark, and shows that joint optimization via CoSAE recovers strong alignment even with 0.1% paired data.
-
HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds
HH-SAE factorizes manifolds into nested contextual (L0), atomic (f1), and compository (f2) tiers, achieving 0.9156 cross-domain zero-shot AUC in fraud detection and +9.9% AUPRC lift in steered synthesis.
-
Structural Instability of Feature Composition
Feature composition in SAEs collapses asymptotically when the Gaussian mean width of the signal cone is exceeded, with ReLU inducing a ratchet-like accumulation of interference from correlations.
-
Do Sparse Autoencoders Learn Meaningful Concept Hierarchies?
Sparse autoencoders provide a basis for sensible concept hierarchies on visual data but are undermined by hard and soft feature absorption.
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
-
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.