Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.
super hub Canonical reference
Toy Models of Superposition
Canonical reference. 85% of citing Pith papers cite this work as background.
abstract
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
authors
co-cited works
representative citing papers
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
Introduces PairSAE, a sparse autoencoder for pair representations in structural biology foundation models that produces features aligned with UniProt annotations and affinity predictions.
Sparsity regularizers applied before Top-k selection in SAEs improve monosemanticity and make reconstruction robust to inference-time k across vision models and datasets.
An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.
Evaluation of two latent reasoning models against controls shows observable latent patterns appear without the proposed mechanisms, have graded causal effects on behavior, and concentrate in structured low-rank directions, arguing that patterns are insufficient evidence for reasoning.
VFUSE applies sparse autoencoders to diffusion-transformer activations in RoseTTAFold3 and RFDiffusion3 to find monosemantic features that detect hazardous protein designs with AUROC up to 0.84.
LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.
Introduces coherence as a topological constraint on representations and the Coh objective to enforce geometric clustering for interpretability in neural networks.
Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.
Subliminal learning is a LoRA artifact that disappears with full finetuning, depends on context tokens like system prompts, and localizes to overlapping finetuning-evaluation tokens.
SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.
A Riemannian geodesic framework for label-free manifold steering in language models via a schema-supervised encoder approximating output Hellinger distance on activations.
Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
Event-grounded SAE analysis in VLA policies produces stronger causal effects on robot behavior than standard methods by anchoring features to clustered end-effector keyframes across simulations and real-robot tests.
In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
A framework using sparse autoencoders decomposes concept-level forgetting in supervised continual learning into apparent deletion, recoverability, and decodability, showing substantial recoverability under linearity and degrading decodability with added tasks.
citing papers explorer
-
Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders
Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.
-
When Does LeJEPA Learn a World Model?
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
Crafting Reversible SFT Behaviors in Large Language Models
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
-
SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
-
PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding
Introduces PairSAE, a sparse autoencoder for pair representations in structural biology foundation models that produces features aligned with UniProt annotations and affinity predictions.
-
Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders
Sparsity regularizers applied before Top-k selection in SAEs improve monosemanticity and make reconstruction robust to inference-time k across vision models and datasets.
-
Embodied Explainability and Ontological Obstacles: Why We Struggle to Explain the Answers of Large Language Models (LLMs)
An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.
-
Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models
Evaluation of two latent reasoning models against controls shows observable latent patterns appear without the proposed mechanisms, have graded causal effects on behavior, and concentrate in structured low-rank directions, arguing that patterns are insufficient evidence for reasoning.
-
VFUSE: Virulent Feature Understanding with Sparse autoEncoders
VFUSE applies sparse autoencoders to diffusion-transformer activations in RoseTTAFold3 and RFDiffusion3 to find monosemantic features that detect hazardous protein designs with AUROC up to 0.84.
-
Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control
LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.
-
Learning Coherent Representations: A Topological Approach to Interpretability
Introduces coherence as a topological constraint on representations and the Coh objective to enforce geometric clustering for interpretability in neural networks.
-
Subliminal Learning Is Steering Vector Distillation
Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.
-
Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering
A Riemannian geodesic framework for label-free manifold steering in language models via a schema-supervised encoder approximating output Hellinger distance on activations.
-
Learning Through Noise: Why Subliminal Learning Works and When It Fails
Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
-
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning
A framework using sparse autoencoders decomposes concept-level forgetting in supervised continual learning into apparent deletion, recoverability, and decodability, showing substantial recoverability under linearity and degrading decodability with added tasks.
-
SMIXAE: Towards Unsupervised Manifold Discovery in Language Models
SMIXAE is a new mixture-of-autoencoders architecture that learns multidimensional manifolds directly from transformer activations, recovering known structures and identifying novel ones in Gemma 2 2B and 9B models.
-
From Mechanistic to Compositional Interpretability
The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.
-
What Cohort INRs Encode and Where to Freeze Them
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
-
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval
Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
-
Adjoint Inversion Reveals Holographic Superposition and Destructive Interference in CNN Classifiers
CNN classifiers work by holographic superposition and destructive interference in pixel space rather than selecting cleaned features, as proven by a new adjoint inversion framework that also yields a covariance-volume channel selection algorithm.
-
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
-
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.
-
Isotropic Activation Functions Enable Deindividuated Neurons and Adaptive Topologies
Isotropic activation functions derived from reparameterisation symmetries and SVD diagonalisation enable function-preserving neuron removal and addition in dense networks, supporting up to 50% sparsification and real-time topology adaptation.
-
Mechanistic Interpretability with Sparse Autoencoder Neural Operators
SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
Improving Dictionary Learning with Gated Sparse Autoencoders
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
-
HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
HARC couples harmfulness and refusal directions across prompt and response positions via subspace fine-tuning, achieving better robustness-capability-usability trade-off than six baselines while transferring across model families.
-
Radical AI Interpretability
A framework is proposed for solving for an AI system's beliefs and desires from its computational facts, with criteria for success tied to interpretability tests and emphasis on holistic attribution.
-
Steering Vision-Language Models with Joint Sparse Autoencoders
JSAE jointly factorizes pooled vision and language activations in VLMs into aligned interpretable features, revealing layer-dependent asymmetry in additive steering versus suppression on three models.
-
Concept Removal for Frontier Image Generative Models
A transcoder-based in-place replacement of the bottleneck layer enables selective concept removal in modern diffusion and autoregressive image models without degrading output quality.
-
Evidence for feature-specific error correction in LLMs
Perturbation experiments across six LLMs show activation robustness follows L^p norm with p>2 for feature directions (contrastive, MELBO, SAE) but p≈2 for random/PCA controls, indicating feature-specific error correction.
-
Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders
A new SAE-based framework extracts visual, textual, and multimodal concepts from VLMs and reports up to 45% better visual concept quality on a VQA dataset while identifying multimodal concepts.
-
Critical Percolation as a Synthetic Data Model for Interpretability
Critical percolation clusters embedded in high dimensions, combined with taxonomic latent variables, form an analytically tractable synthetic data model whose ground-truth hierarchy can be linearly decoded from network activations.
-
Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds
Compositionality emerges in neural networks only in a narrow depth-connectivity regime, with gradient descent converging to fractured solutions outside it.
-
From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability
Derives an upper bound on frozen LM expected risk from proxy risk, SAE reconstruction gap, concept-pool mismatch and sparse complexity, with non-vacuous bounds observed on GPT-2, Gemma-2B and Llama-3-8B.
-
When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations
Proposes using sparse autoencoders to extract class-conditioned concept vectors, then measuring logit stability under targeted perturbations as an interpretable OOD signal for deep networks in medical imaging.
-
Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs
Adversarial fine-tuning evades activation-based steganography detection in five LLMs while preserving secret recovery, but a recontextualization dataset restores both ridge and MLP probe detectability.
-
Interactions Between Crosscoder Features: A Compact Proofs Perspective
Derives an interaction measure between crosscoder features from reconstruction error in compact proofs and applies it to produce computationally sparse crosscoders retaining 60% MLP performance with single-feature selection versus 10% for standard crosscoders.
-
Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects
Pre-intervention feature statistics predict SAE steering modularity (stability and collateral spread) better than baselines across multiple models and dictionaries, with model-dependent success in held-out selection.
-
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs
Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.
-
TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment
TEVI applies sparse autoencoders and caption-conditioned masking to edit image embeddings, yielding better retrieval on MS COCO, Flickr, IIW, DOCCI, and RoCOCO benchmarks with larger gains on richer captions.