Improving Robustness In Sparse Autoencoders via Masked Regularization

· 2026 · cs.LG · arXiv 2604.06495

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often result in brittle latent representations. SAEs are known to be prone to feature absorption, where general features are subsumed by more specific ones due to co-occurrence, degrading interpretability despite high reconstruction fidelity. Recent negative results on Out-of-Distribution (OOD) performance further underscore broader robustness related failures tied to under-specified training objectives. We address this by proposing a masking-based regularization that randomly replaces tokens during training to disrupt co-occurrence patterns. This improves robustness across SAE architectures and sparsity levels reducing absorption, enhancing probing performance, and narrowing the OOD gap. Our results point toward a practical path for more reliable interpretability tools.

representative citing papers

Improving Robustness In Sparse Autoencoders via Masked Regularization

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.

citing papers explorer

Showing 1 of 1 citing paper.

Improving Robustness In Sparse Autoencoders via Masked Regularization cs.LG · 2026-04-07 · unverdicted · none · ref 2 · internal anchor
Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.

Improving Robustness In Sparse Autoencoders via Masked Regularization

fields

years

verdicts

representative citing papers

citing papers explorer