Manifold curvature and intrinsic dimension predict layerwise SAE width exponents and asymptotic floors across Gemma models, with cross-model transfer of the geometric regression, establishing a transferable geometric law instead of a universal scaling law.
Decomposing the dark matter of sparse autoencoders.Transactions on Machine Learning Research
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 2polarities
background 2representative citing papers
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
citing papers explorer
-
The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws
Manifold curvature and intrinsic dimension predict layerwise SAE width exponents and asymptotic floors across Gemma models, with cross-model transfer of the geometric regression, establishing a transferable geometric law instead of a universal scaling law.
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
-
Geometric Routing Enables Causal Expert Control in Mixture of Experts
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.