Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
Projecting assumptions: The duality between sparse autoencoders and concept geometry
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.
SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.
GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.
citing papers explorer
-
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions
Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
-
From Mechanistic to Compositional Interpretability
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.
-
Mechanistic Interpretability with Sparse Autoencoder Neural Operators
SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.
-
GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models
GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.