URL https://distill

Chris Olah, Alexander Mordvintsev, Ludwig Schubert · 2017 · DOI 10.23915/distill.00007

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open at publisher browse 10 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Toy Models of Superposition

cs.LG · 2022-09-21 · accept · novelty 8.0

Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

Sign patterns in the unrotated standard basis of transformer activations form independent binary feature registers that support training-free detection, prediction, and causal intervention across language, vision, and audio models.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.

Steering Vision-Language Models with Joint Sparse Autoencoders

cs.CV · 2026-06-24 · unverdicted · novelty 6.0

JSAE jointly factorizes pooled vision and language activations in VLMs into aligned interpretable features, revealing layer-dependent asymmetry in additive steering versus suppression on three models.

Feature Visualization Recovers Known Cortical Selectivity from TRIBE v2

q-bio.NC · 2026-05-13 · unverdicted · novelty 6.0

Feature visualization on TRIBE v2 brain encoders recovers the known ventral visual hierarchy from V1 to V4 and produces distinctive patterns for MT, FFA, and PPA, with optimized stimuli driving ~4x higher activation than natural images.

HOLE: Homological Observation of Latent Embeddings for Neural Network Interpretability

cs.LG · 2025-12-08 · unverdicted · novelty 6.0

HOLE applies persistent homology to latent embeddings in neural networks and uses visualizations such as cluster flow diagrams to reveal patterns of class separation, feature disentanglement, and robustness.

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

cs.CL · 2026-06-05 · unverdicted · novelty 5.0

Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.

NeuroViz: Real-time Interactive Visualization of Forward and Backward Passes in Neural Network Training

cs.LG · 2026-05-03 · unverdicted · novelty 5.0

NeuroViz offers interactive real-time visualization of neural network forward and backward passes, achieving top usability scores in a study with 31 participants compared to existing tools.

Open Problems in Mechanistic Interpretability

cs.LG · 2025-01-27 · unverdicted · novelty 3.0

A review paper that organizes conceptual, practical, and socio-technical open problems in mechanistic interpretability.

Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images

cs.LG · 2026-06-30

citing papers explorer

Showing 1 of 1 citing paper after filters.

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces cs.CL · 2026-06-05 · unverdicted · none · ref 54
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.

URL https://distill

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer