Enhancing neural network interpretability with feature-aligned sparse autoencoders

Luke Marks, Alasdair Paren, David Krueger, Fazl Barez · 2024 · arXiv 2411.01220

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

A Unifying Framework for Concept-Based Representational Similarity

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

A unifying framework decomposes concept alignment into instance-wise and distributional translation and concept consistency, introduces the InterVenchA benchmark, and shows that joint optimization via CoSAE recovers strong alignment even with 0.1% paired data.

Improving Sparse Autoencoder with Dynamic Attention

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.

Perplexity Can Miss SAE Feature Damage Under Quantization

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Quantization of LLMs can degrade many SAE features even when perplexity improves or stays similar, as shown by correlation measurements on frozen SAEs for Pythia-70M and Gemma-2-2B models across INT8 to INT4.

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

cs.LG · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Aligned training reparameterizes SAEs to enforce unit alignment between encoder and decoder directions, yielding Pareto gains on SAEBench while removing dead features and improving stability.

Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates

cs.CE · 2026-03-28 · unverdicted · novelty 6.0

Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Enhancing neural network interpretability with feature-aligned sparse autoencoders

fields

years

verdicts

representative citing papers

citing papers explorer