pith. sign in

arxiv: 2309.08600 · v3 · submitted 2023-09-15 · 💻 cs.LG · cs.CL

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Pith reviewed 2026-05-24 06:40 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords sparse autoencoderspolysemanticitysuperpositionlanguage modelsmechanistic interpretabilitymonosemantic featuresindirect object identification
0
0 comments X

The pith

Sparse autoencoders recover sets of sparsely activating features from language model activations that are more interpretable and monosemantic than those found by prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that training sparse autoencoders on the internal activations of a language model can extract features that each correspond to a single semantic concept and activate only in limited contexts. This tackles polysemanticity, the tendency of neurons to respond to multiple unrelated ideas at once, which blocks clear explanations of model behavior. If the extracted features are genuine, researchers could read off human-understandable reasons for what a model is computing at any layer. The authors test this on one language model and report that the features score higher on automated interpretability measures than directions found by other decomposition techniques. They further apply the features to trace which specific ones drive counterfactual changes on the indirect object identification task, achieving finer resolution than earlier decompositions.

Core claim

Sparse autoencoders trained to reconstruct language model activations learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches. With these features the authors can identify the ones causally responsible for counterfactual behavior on the indirect object identification task to a finer degree than previous decompositions. The work indicates that superposition can be resolved in language models by a scalable unsupervised method.

What carries the argument

Sparse autoencoders trained to reconstruct internal activations while encouraging sparse feature activations.

If this is right

  • Superposition in language models can be resolved with a scalable unsupervised method.
  • Mechanistic interpretability work can proceed from a dictionary of monosemantic features rather than polysemantic neurons.
  • Causal responsibility for specific model behaviors can be attributed to individual features at higher resolution than before.
  • Greater model transparency and steerability become feasible once features are isolated this way.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training procedure could be applied to activations from other model families or tasks to test whether the improvement in monosemanticity generalizes.
  • If the learned features remain stable across different random seeds and training runs, they would provide a more reliable basis for editing model behavior.
  • Combining the sparse autoencoder dictionary with causal tracing methods might allow systematic editing of high-level concepts without side effects on unrelated behaviors.

Load-bearing premise

The features found by the autoencoders correspond to genuine semantic concepts inside the language model rather than training artifacts, and the automated interpretability metrics accurately reflect human-understandable monosemanticity.

What would settle it

A direct test in which the identified features are ablated or scaled during a forward pass on the indirect object identification task and the expected change in output either fails to appear or appears for unrelated features.

Figures

Figures reproduced from arXiv: 2309.08600 by Aidan Ewart, Hoagy Cunningham, Lee Sharkey, Logan Riggs, Robert Huben.

Figure 1
Figure 1. Figure 1: An overview of our method. We a) sample the internal activations of a language model, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average top-and-random autointerpretability score of our learned directions in the residual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Number of features patched vs KL divergence from target, using various residual [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Histogram of token counts for dictionary feature 556. (Left) For all datapoints that activate [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Circuit for the closing parenthesis dictionary feature, with human interpretations of each [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The tradeoff between the average number of features active and the proportion of variance [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The tradeoff between sparsity and unexplained variance in our reconstruction. Each series [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of average interpretability scores across dictionary sizes. All dictionaries were [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Random-only interpretability scores across each layer, a measure of how well the inter [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Top-and-random and random-only interpretability scores for across each MLP layer, [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Histogram of token counts in the neuron basis. Although there are a large fraction of [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: ‘If’ feature in coding contexts [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: ‘Dis’ token-level feature showing bigrams, such as ‘disCLAIM’, ‘disclosed’, ‘disor [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Apostrophe feature in “I’ll”-like contexts. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Apostrophe feature in “don’t”-like contexts. [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The number of features that are active, defined as activating more than 10 times across [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Divergence from target output against number of features patched and magnitude of edits [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Autointerpretation scores across layers for the residual stream, including top-K baselines [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
read the original abstract

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that sparse autoencoders applied to language model internal activations recover sparsely activating features that are more interpretable and monosemantic than directions from alternative approaches (with interpretability assessed via automated methods), and that these features enable finer-grained causal attribution of counterfactual behavior on the indirect object identification task than prior decompositions. It concludes that this provides a scalable unsupervised method for resolving superposition.

Significance. If the automated interpretability metrics are shown to track human judgments of monosemanticity and the recovered features are demonstrated to be model-internal concepts rather than training artifacts, the work would be significant as a practical, scalable tool for mechanistic interpretability. The IOI causal attribution result would strengthen the case for using such decompositions in downstream analysis.

major comments (2)
  1. [Abstract] Abstract: The central comparative claim that SAE features are 'more interpretable and monosemantic than directions identified by alternative approaches' rests entirely on unspecified automated methods. No details are given on the metrics themselves, their correlation with human judgments, or controls to rule out SAE-specific artifacts, making it impossible to evaluate whether the monosemanticity gains are genuine or artifactual. This is load-bearing for both the interpretability claim and the downstream IOI causal attribution result.
  2. [Abstract] Abstract: The claim that the method 'pinpoint[s] the features that are causally responsible for counterfactual behaviour on the indirect object identification task to a finer degree than previous decompositions' requires quantitative comparison (e.g., effect sizes, ablation results, or error bars) showing improvement over baselines such as the original IOI circuit analysis; the abstract supplies none, leaving the 'finer degree' assertion unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and detailed comments on our manuscript. We address each of the major comments below. We agree that the abstract would benefit from greater specificity on both points and will revise it accordingly while preserving its high-level nature.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central comparative claim that SAE features are 'more interpretable and monosemantic than directions identified by alternative approaches' rests entirely on unspecified automated methods. No details are given on the metrics themselves, their correlation with human judgments, or controls to rule out SAE-specific artifacts, making it impossible to evaluate whether the monosemanticity gains are genuine or artifactual. This is load-bearing for both the interpretability claim and the downstream IOI causal attribution result.

    Authors: We agree the abstract does not detail the automated metrics. The manuscript (Sections 3 and 4) specifies the metrics as automated scoring of feature descriptions and activation contexts via a separate language model, with explicit baselines including random directions and PCA. Limited human correlation studies appear in Appendix B. Controls for artifacts are included via comparison to non-SAE decompositions. We will revise the abstract to briefly characterize the metrics and point to these sections. revision: yes

  2. Referee: [Abstract] Abstract: The claim that the method 'pinpoint[s] the features that are causally responsible for counterfactual behaviour on the indirect object identification task to a finer degree than previous decompositions' requires quantitative comparison (e.g., effect sizes, ablation results, or error bars) showing improvement over baselines such as the original IOI circuit analysis; the abstract supplies none, leaving the 'finer degree' assertion unsupported.

    Authors: The abstract summarizes results whose quantitative details (ablation effect sizes, comparisons to the original IOI circuit, and error bars across runs) are reported in Section 5. We will revise the abstract to include a short quantitative qualifier or reference to the effect-size improvements shown in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external automated metrics and task-specific causal tests

full rationale

The paper's core argument is an empirical demonstration that sparse autoencoders trained on language-model activations recover features judged more interpretable than baselines by automated scoring methods, plus finer causal attribution on the IOI task. No equation or claim reduces a reported prediction or uniqueness result to a fitted parameter or self-citation by construction; the reconstruction objective is standard and the interpretability comparison is performed against external baselines using metrics defined outside the fitted values. Self-citations appear only for background (e.g., the IOI task) and are not load-bearing for the central monosemanticity or superposition-resolution claims. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical assumption that sparse autoencoders can disentangle superimposed features; this is treated as a domain assumption rather than derived.

axioms (1)
  • domain assumption Polysemanticity is caused by superposition in neural networks
    Abstract states this as the hypothesized cause of polysemanticity.

pith-pipeline@v0.9.0 · 5766 in / 1221 out tokens · 27762 ms · 2026-05-24T06:40:45.628354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  3. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

  4. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

  5. Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...

  6. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.

  7. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

  8. Slot Machines: How LLMs Keep Track of Multiple Entities

    cs.CL 2026-04 unverdicted novelty 8.0

    LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

  9. KAN: Kolmogorov-Arnold Networks

    cs.LG 2024-04 conditional novelty 8.0

    KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

  10. Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.

  11. Markovian Circuit Tracing for Transformer State Dynamic

    cs.LG 2026-05 unverdicted novelty 7.0

    This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching i...

  12. Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

    cs.CV 2026-05 conditional novelty 7.0

    Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.

  13. Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

    cs.LG 2026-05 unverdicted novelty 7.0

    In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight...

  14. Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE

    cs.LG 2026-05 unverdicted novelty 7.0

    KAN-SAE applies nonlinear per-feature B-spline activations in sparse autoencoders to discover 72% more alive climate features and interpretable patterns such as European heatwaves and Pacific typhoons in deep learning...

  15. Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

    cs.RO 2026-05 conditional novelty 7.0

    Event-grounded SAE analysis in VLA policies produces stronger causal effects on robot behavior than standard methods by anchoring features to clustered end-effector keyframes across simulations and real-robot tests.

  16. When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

    cs.LG 2026-05 unverdicted novelty 7.0

    Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.

  17. Lost and Found in Translation: Variational Diagnostics for Neural Codebook Channels

    cs.LG 2026-05 unverdicted novelty 7.0

    Defines the neural codebook channel K_{e→d}(j|i) and proves a Bernoulli-KL bound on encoder-decoder mismatch in VAEs that cannot be recovered from marginal histograms or mutual information.

  18. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 7.0

    WriteSAE factors sparse autoencoder decoder atoms to the native d_k x d_v cache write shape in recurrent models, provides a closed-form logit shift, and demonstrates high success in atom substitution and behavioral ed...

  19. Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

    cs.LG 2026-05 unverdicted novelty 7.0

    SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.

  20. SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

    cs.LG 2026-05 unverdicted novelty 7.0

    SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

  21. Interpreting Reinforcement Learning Agents with Susceptibilities

    cs.LG 2026-05 unverdicted novelty 7.0

    Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

  22. What Cohort INRs Encode and Where to Freeze Them

    cs.LG 2026-05 unverdicted novelty 7.0

    Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

  23. PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

    cs.LG 2026-05 unverdicted novelty 7.0

    PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.

  24. Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

  25. SMolLM: Small Language Models Learn Small Molecular Grammar

    cs.LG 2026-05 unverdicted novelty 7.0

    A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.

  26. Transformers with Selective Access to Early Representations

    cs.LG 2026-05 unverdicted novelty 7.0

    SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

  27. Transformers with Selective Access to Early Representations

    cs.LG 2026-05 unverdicted novelty 7.0

    SATFormer uses a learned context-dependent gate for selective access to early-layer value representations in Transformers, improving loss and accuracy over static residual baselines.

  28. How Language Models Process Negation

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.

  29. Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

    cs.AI 2026-05 unverdicted novelty 7.0

    A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in ...

  30. Linear-Readout Floors and Threshold Recovery in Computation in Superposition

    cs.LG 2026-05 unverdicted novelty 7.0

    Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contr...

  31. Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.

  32. A Unifying Framework for Unsupervised Concept Extraction

    cs.LG 2026-04 unverdicted novelty 7.0

    A meta-theorem reduces establishing identifiability guarantees for concept extraction methods to the problem of characterizing the intersection of two sets.

  33. Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 7.0

    Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

  34. Grokking of Diffusion Models: Case Study on Modular Addition

    cs.LG 2026-04 unverdicted novelty 7.0

    Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

  35. Diverse Dictionary Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    Diverse dictionary learning identifies intersections, complements, and dependency structures of latent variables from data X = g(Z) up to indeterminacies, and full identifiability when structural diversity is sufficient.

  36. Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

    cs.CV 2026-04 unverdicted novelty 7.0

    Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.

  37. Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

    cs.CV 2026-04 unverdicted novelty 7.0

    The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model pr...

  38. What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features

    cs.CL 2026-04 unverdicted novelty 7.0

    Effective multilingual reasoning in large models relies on language-specific patterns in reasoning features rather than uniform English-like traces.

  39. MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

    cs.LG 2026-04 conditional novelty 7.0

    Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.

  40. How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

    cs.CL 2026-01 unverdicted novelty 7.0

    Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic as...

  41. V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

    cs.CL 2025-09 conditional novelty 7.0

    V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM perf...

  42. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  43. Scaling and evaluating sparse autoencoders

    cs.LG 2024-06 unverdicted novelty 7.0

    K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

  44. Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Transcoders decompose MLP layers in Gemma 3-4B-IT to trace visual grounding more effectively than SAEs and predict hallucinations from circuit graph features at AUC 0.68.

  45. Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance

    cs.CL 2026-05 unverdicted novelty 6.0

    Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.

  46. Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

    cs.CV 2026-05 unverdicted novelty 6.0

    SAE-FT uses a sparse autoencoder on pre-trained CLIP visual representations to regularize fine-tuning by penalizing changes to semantically meaningful features, aiming for robust performance on ImageNet and distributi...

  47. Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

    cs.AI 2026-05 unverdicted novelty 6.0

    Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.

  48. Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

    cs.LG 2026-05 unverdicted novelty 6.0

    Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.

  49. Why Retrieval-Augmented Generation Fails: A Graph Perspective

    cs.CL 2026-05 unverdicted novelty 6.0

    Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.

  50. Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    TopK SAEs applied to EEG transformers extract clinical features, enable concept steering, and identify selectively steerable, entangled, and non-encoded regimes with a spectral decoder for physiological interpretation.

  51. Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Sparse autoencoders on EEG transformers extract clinical features, identify three steering regimes, expose age-pathology entanglements and wrecking-ball failures, and map interventions to frequency spectra.

  52. Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Sparse autoencoders on EEG transformers identify three regimes of clinical concept encoding and reveal entanglements such as age-pathology confounding via a new steering selectivity metric.

  53. Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture

    cs.LO 2026-05 unverdicted novelty 6.0 partial

    Introduces a trust-boundary architecture in Lean 4 with three certificate families and two operators that deliver sorry-free, axiom-audited assurances for LLM pipeline components.

  54. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 po...

  55. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  56. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  57. Domain Restriction via Multi SAE Layer Transitions

    cs.AI 2026-05 unverdicted novelty 6.0

    Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.

  58. Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.

  59. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  60. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 106 Pith papers · 6 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.\ 2397--2430. PMLR, 2023

  3. [3]

    Language models can explain neurons in language models

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2023

  4. [4]

    Curve circuits

    Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea Voss, Ludwig Schubert, and Chris Olah. Curve circuits. Distill, 2021. doi:10.23915/distill.00024.006. https://distill.pub/2020/circuits/curve-circuits

  5. [5]

    Towards automated circuit discovery for mechanistic interpretability

    Arthur Conmy, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997, 2023

  6. [6]

    Adaptively sparse transformers

    Gon c alo M Correia, Vlad Niculae, and Andr \'e FT Martins. Adaptively sparse transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2174--2184, 2019

  7. [7]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022

  8. [8]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021

  9. [9]

    Softmax linear units

    Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav K...

  10. [10]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022 b

  11. [11]

    Privileged bases in the transformer residual stream, 2023

    Nelson Elhage, Robert Lasenby, and Chris Olah. Privileged bases in the transformer residual stream, 2023. URL https://transformer-circuits.pub/2023/privileged-basis/index.html. Accessed: 2023-08-07

  12. [12]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018

  13. [13]

    Cognitron: A self-organizing multilayered neural network

    Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural network. Biol. Cybern., 20 0 (3–4): 0 121–136, sep 1975. ISSN 0340-1200. doi:10.1007/BF00342633. URL https://doi.org/10.1007/BF00342633

  14. [14]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  15. [15]

    Accelerating convolutional neural networks via activation map compression

    Georgios Georgiadis. Accelerating convolutional neural networks via activation map compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7085--7095, 2019

  16. [16]

    An Overview of Catastrophic AI Risks

    Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001, 2023

  17. [17]

    Elite backprop: Training sparse interpretable neurons

    Theodoros Kasioumis, Joe Townsend, and Hiroya Inakoshi. Elite backprop: Training sparse interpretable neurons. In NeSy, pp.\ 82--93, 2021

  18. [18]

    Efficient sparse coding algorithms

    Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Efficient sparse coding algorithms. Advances in neural information processing systems, 19, 2006

  19. [19]

    Is sparse attention more interpretable? In Annual Meeting of the Association for Computational Linguistics, 2021

    Clara Meister, Stefan Lazov, Isabelle Augenstein, and Ryan Cotterell. Is sparse attention more interpretable? In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:235293798

  20. [20]

    The Alignment Problem from a Deep Learning Perspective , May 2025

    Richard Ngo, Lawrence Chan, and S \"o ren Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022

  21. [21]

    Zoom in: An introduction to circuits

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5 0 (3): 0 e00024--001, 2020

  22. [22]

    Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37 0 (23): 0 3311--3325, 1997

    Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37 0 (23): 0 3311--3325, 1997

  23. [23]

    Sparse coding of sensory inputs

    Bruno A Olshausen and David J Field. Sparse coding of sensory inputs. Current opinion in neurobiology, 14 0 (4): 0 481--487, 2004

  24. [24]

    Analysis of the optimization landscapes for overcomplete representation learning

    Qing Qu, Yuexiang Zhai, Xiao Li, Yuqian Zhang, and Zhihui Zhu. Analysis of the optimization landscapes for overcomplete representation learning. arXiv preprint arXiv:1912.02427, 2019

  25. [25]

    Taking features out of superposition with sparse autoencoders, 2023

    Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders, 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition. Accessed: 2023-05-10

  26. [26]

    Investigating gender bias in language models using causal mediation analysis

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33: 0 12388--12401, 2020

  27. [27]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022

  28. [28]

    High-dimensional data analysis with low-dimensional models: Principles, computation, and applications

    John Wright and Yi Ma. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022

  29. [29]

    Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors.arXiv preprint arXiv:2103.15949,

    Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021

  30. [30]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  31. [31]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  32. [32]

    patterned

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...