hub Canonical reference

https://distill.pub/2020/circuits/zoom-in

Olah, Chris, Cammarata, Nick, Schubert, Ludwig, Goh, Gabriel, Petrov, Michael, Carter, Shan , title = · 2020 · DOI 10.23915/distill.00024.001

Canonical reference. 100% of citing Pith papers cite this work as background.

19 Pith papers citing it

Background 100% of classified citations

open at publisher browse 19 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Toy Models of Superposition

cs.LG · 2022-09-21 · accept · novelty 8.0

Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.

Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.

Data-driven Circuit Discovery for Interpretability of Language Models

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion model that are clearer than those from entangled baselines.

Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.

Improving Dictionary Learning with Gated Sparse Autoencoders

cs.LG · 2024-04-24 · unverdicted · novelty 7.0

Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.

In-context Learning and Induction Heads

cs.LG · 2022-09-24 · unverdicted · novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.

From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

XWP and XWP_c are novel attribution methods for FCNNs that estimate feature importance by perturbing attached weights to avoid added bias and out-of-distribution issues in occlusion approaches.

Composer Vector: Style-steering Symbolic Music Generation in a Latent Space

cs.SD · 2026-04-03 · unverdicted · novelty 6.0

Composer Vector steers symbolic music generation models in latent space at inference time to control and blend composer styles without retraining.

Feature Identification via the Empirical NTK

cs.LG · 2025-10-01 · unverdicted · novelty 6.0

Eigenanalysis of the empirical NTK surfaces feature directions that align with Fourier features in modular addition networks and grammatical features in Gemma-3-270M, outperforming PCA baselines on activations.

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

cs.AI · 2025-09-30 · unverdicted · novelty 6.0

Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific

Linear Representations of Sentiment in Large Language Models

cs.LG · 2023-10-23 · unverdicted · novelty 6.0

Sentiment is represented as a single linear direction in LLM activation space that is causally relevant across tasks and is summarized at punctuation and names in addition to charged words.

Towards Effective Theory of LLMs: A Representation Learning Approach

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

Engineering Resource-constrained Software Systems with DNN Components: a Concept-based Pruning Approach

cs.SE · 2026-04-11 · unverdicted · novelty 5.0

A concept-based pruning method for DNNs guided by interpretable concepts and system requirements produces smaller, computationally efficient models that maintain effectiveness on image classification tasks.

"Faithful to What?" On the Limits of Fidelity-Based Explanations

cs.LG · 2025-06-13 · unverdicted · novelty 5.0

High-fidelity surrogate explanations for neural networks often fail to recover the networks' predictive advantages over linear models in regression tasks.

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

cs.LG · 2024-08-09 · accept · novelty 4.0

Gemma Scope supplies trained sparse autoencoders for all layers of Gemma 2 2B and 9B plus select 27B layers, with public weights and benchmark scores.

citing papers explorer

Showing 19 of 19 citing papers.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small cs.LG · 2022-11-01 · conditional · none · ref 16
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
Toy Models of Superposition cs.LG · 2022-09-21 · accept · none · ref 1
Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2 cs.LG · 2026-05-13 · unverdicted · none · ref 13
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
Data-driven Circuit Discovery for Interpretability of Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 20
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 204
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models cs.LG · 2026-05-06 · unverdicted · none · ref 5
LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion model that are clearer than those from entangled baselines.
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction cs.AI · 2026-05-04 · unverdicted · none · ref 17
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
Improving Dictionary Learning with Gated Sparse Autoencoders cs.LG · 2024-04-24 · unverdicted · none · ref 41
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
In-context Learning and Induction Heads cs.LG · 2022-09-24 · unverdicted · none · ref 19
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute cs.AI · 2026-05-14 · unverdicted · none · ref 12
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks cs.LG · 2026-05-14 · unverdicted · none · ref 19
XWP and XWP_c are novel attribution methods for FCNNs that estimate feature importance by perturbing attached weights to avoid added bias and out-of-distribution issues in occlusion approaches.
Composer Vector: Style-steering Symbolic Music Generation in a Latent Space cs.SD · 2026-04-03 · unverdicted · none · ref 15
Composer Vector steers symbolic music generation models in latent space at inference time to control and blend composer styles without retraining.
Feature Identification via the Empirical NTK cs.LG · 2025-10-01 · unverdicted · none · ref 10
Eigenanalysis of the empirical NTK surfaces feature directions that align with Fourier features in modular addition networks and grammatical features in Gemma-3-270M, outperforming PCA baselines on activations.
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training cs.AI · 2025-09-30 · unverdicted · none · ref 25
Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific
Linear Representations of Sentiment in Large Language Models cs.LG · 2023-10-23 · unverdicted · none · ref 113
Sentiment is represented as a single linear direction in LLM activation space that is causally relevant across tasks and is summarized at punctuation and names in addition to charged words.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 42
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Engineering Resource-constrained Software Systems with DNN Components: a Concept-based Pruning Approach cs.SE · 2026-04-11 · unverdicted · none · ref 76
A concept-based pruning method for DNNs guided by interpretable concepts and system requirements produces smaller, computationally efficient models that maintain effectiveness on image classification tasks.
"Faithful to What?" On the Limits of Fidelity-Based Explanations cs.LG · 2025-06-13 · unverdicted · none · ref 11
High-fidelity surrogate explanations for neural networks often fail to recover the networks' predictive advantages over linear models in regression tasks.
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 cs.LG · 2024-08-09 · accept · none · ref 6
Gemma Scope supplies trained sparse autoencoders for all layers of Gemma 2 2B and 9B plus select 27B layers, with public weights and benchmark scores.

https://distill.pub/2020/circuits/zoom-in

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer