arxiv: 2501.16496 · v1 · submitted 2025-01-27 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Open Problems in Mechanistic Interpretability

Lee Sharkey , Bilal Chughtai , Joshua Batson , Jack Lindsey , Jeff Wu , Lucius Bushnaq , Nicholas Goldowsky-Dill , Stefan Heimersheim

show 21 more authors

Alejandro Ortega Joseph Bloom Stella Biderman Adria Garriga-Alonso Arthur Conmy Neel Nanda Jessica Rumbelow Martin Wattenberg Nandi Schoots Joseph Miller Eric J. Michaud Stephen Casper Max Tegmark William Saunders David Bau Eric Todd Atticus Geiger Mor Geva Jesse Hoogland Daniel Murfet Tom McGrath

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords mechanistic interpretabilityopen problemsneural networksAI safetyinterpretabilityAI assurancesocio-technical challenges

0 comments

The pith

Mechanistic interpretability must solve open problems in methods, applications, and socio-technical challenges to achieve its goals of AI assurance and scientific insight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews the current frontier of mechanistic interpretability, a field focused on understanding the computational mechanisms inside neural networks. It highlights that despite recent progress, many open problems block the realization of benefits like greater control over AI systems and answers to questions about intelligence. These problems fall into three categories: conceptual and practical improvements to methods for deeper insights, strategies for applying methods to specific goals, and socio-technical challenges that shape the research. Addressing them is presented as necessary before the field can deliver on its promises. A sympathetic reader would see this as a call to prioritize these issues to advance both engineering safety and basic science.

Core claim

The central claim is that progress toward the goals of mechanistic interpretability—providing assurance over AI behavior and illuminating the nature of intelligence—requires solutions to open problems in three areas: improving methods to reveal deeper insights, determining how to best apply methods for concrete objectives, and tackling socio-technical issues influenced by and influencing the work.

What carries the argument

The three-category framework of open problems (methods improvements, application strategies, socio-technical challenges) that organizes the frontier and identifies priorities for future work.

If this is right

Improved methods will uncover deeper computational mechanisms in neural networks.
Application strategies will guide the use of interpretability toward specific scientific and engineering aims.
Resolving socio-technical challenges will allow the field to navigate influences from society and ethics.
Overall progress will lead to greater assurance over AI system behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prioritizing these problems could lead to the development of standardized evaluation metrics for interpretability techniques.
Connections might form with neuroscience as questions about intelligence are addressed.
Testable extensions include community efforts to solve one problem at a time and measure resulting gains in AI understanding.
The review suggests the field would benefit from collaborative roadmaps based on these open problems.

Load-bearing premise

That solving the identified open problems will directly result in greater assurance over AI system behavior and new insights into the nature of intelligence.

What would settle it

Researchers solve several of the listed open problems in methods and applications yet observe no improvement in their ability to predict or assure specific behaviors in trained neural networks.

read the original abstract

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a well-organized review cataloging open problems in mechanistic interpretability without new results or deep analysis.

read the letter

Hi, this paper mainly compiles and structures existing open problems in mechanistic interpretability rather than solving any or presenting fresh data. The authors, a large group from the field, break things down into method improvements, applications to goals like safety, and socio-technical issues. That organization gives a clear snapshot of the current frontier and pulls from the right literature, which makes it a decent reference for seeing where things are stuck. It does a fair job reflecting consensus views on challenges like scaling techniques or linking them to concrete outcomes. The soft spots are that most problems are restated from prior work with limited new framing or evidence on their relative importance. Sections stay high-level without concrete examples of failures or metrics for progress, so it functions more as motivation than a detailed roadmap. The claim that solving these will deliver assurance over AI behavior and scientific insight is stated but not argued with specifics here. This is for researchers already in interpretability or adjacent areas who need a map of open questions. A reader planning projects or catching up would get some value, though anyone expecting technical advances or resolved issues will not. It deserves peer review as an agenda-setting review that can help direct community effort, even if revisions would tighten the prioritization.

Referee Report

0 major / 2 minor

Summary. The paper is a forward-facing review of mechanistic interpretability that outlines the field's aim to reverse-engineer computational mechanisms in neural networks for concrete scientific goals (insights into intelligence) and engineering goals (assurance over AI behavior). It catalogs open problems across three categories—methods (conceptual and practical improvements needed to reveal deeper insights), applications (how best to apply methods to specific goals), and socio-technical challenges—and argues that solving these is required before the promised benefits can be realized.

Significance. If the identified problems are addressed, the review could meaningfully guide prioritization in the field, accelerating progress toward verifiable assurance in AI systems and scientific understanding of learned representations. Its value lies in the structured, comprehensive catalog of open issues drawn from the current frontier; this agenda-setting function is a strength for a review paper and can help coordinate research efforts without introducing new empirical claims or derivations.

minor comments (2)

[Abstract] The abstract effectively motivates the review but could include one sentence on the paper's organizational structure (e.g., how the three categories of open problems are sequenced) to improve reader navigation.
Some problem descriptions in the methods section would benefit from a short example or citation to a concrete recent paper illustrating the gap, to make the open-problem statements more actionable for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of the manuscript, as well as their recommendation to accept. We appreciate the recognition that the paper's primary contribution is its structured catalog of open problems across methods, applications, and socio-technical challenges, which can help coordinate research efforts in mechanistic interpretability.

Circularity Check

0 steps flagged

No significant circularity; purely descriptive review of open problems

full rationale

This paper is a forward-facing review that catalogs open problems in mechanistic interpretability without presenting any mathematical derivations, empirical predictions, fitted models, or load-bearing technical claims. Its central statements are motivational framing about the field's potential benefits and the need to address listed challenges in methods, applications, and socio-technical issues. No self-definitional steps, fitted inputs renamed as predictions, or self-citation chains exist that could reduce any result to its own inputs by construction. The document is self-contained as an agenda-setting review with no internal derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a review paper with no new mathematical derivations, fitted parameters, or postulated entities; it relies on standard domain knowledge of the field.

pith-pipeline@v0.9.0 · 5528 in / 957 out tokens · 34968 ms · 2026-05-14T18:25:38.231306+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tracing Persona Vectors Through LLM Pretraining
cs.CL 2026-05 unverdicted novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
cs.LG 2026-05 conditional novelty 7.0

fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
From Mechanistic to Compositional Interpretability
cs.LG 2026-05 unverdicted novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaran...
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
cs.LG 2026-05 unverdicted novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Linear-Readout Floors and Threshold Recovery in Computation in Superposition
cs.LG 2026-05 unverdicted novelty 7.0

Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contr...
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
cs.AI 2026-05 unverdicted novelty 7.0

Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
Diverse Dictionary Learning
cs.LG 2026-04 unverdicted novelty 7.0

Diverse dictionary learning identifies intersections, complements, and dependency structures of latent variables from data X = g(Z) up to indeterminacies, and full identifiability when structural diversity is sufficient.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 7.0

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
Bilinear autoencoders find interpretable manifolds
cs.LG 2026-05 unverdicted novelty 6.0

Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation...
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 7...
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

Circuit analysis reveals that routing circuits for agent memory emerge at 0.6B parameters while content circuits emerge at 4B, with a shared grounding hub and an unsupervised diagnostic achieving 76.2% accuracy for lo...
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
cs.CL 2026-04 unverdicted novelty 6.0

Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 6.0

Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Functional Similarity Metric for Neural Networks: Overcoming Parametric Ambiguity via Activation Region Analysis
cs.LG 2026-04 unverdicted novelty 6.0

A functional similarity metric for ReLU networks uses normalized activation region signatures and MinHash to overcome parametric symmetries like neuron permutation and scaling.
Metaphor Is Not All Attention Needs
cs.CL 2026-05 unverdicted novelty 5.0

Poetic jailbreaks succeed because they induce distinct attention patterns in LLMs that are independent of harmful-content detection, not because models fail to recognize literary formatting.
Do Linear Probes Generalize Better in Persona Coordinates?
cs.AI 2026-05 unverdicted novelty 5.0

Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
cs.CL 2026-05 unverdicted novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
High-Dimensional Statistics: Reflections on Progress and Open Problems
math.ST 2026-05 unverdicted novelty 2.0

A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.
There Will Be a Scientific Theory of Deep Learning
stat.ML 2026-04 unverdicted novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 22 Pith papers · 3 internal anchors

[1]

a is b” fail to learn “b is a

URL https://assets.anthropic.com/m/24a47b00f10301cd/original/ Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf . Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, ...

work page doi:10.1073/pnas.1907375117 2024
[2]

https://distill.pub/2019/activation-atlas

doi: 10.23915/distill.00015. https://distill.pub/2019/activation-atlas. Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, and Gabriel Kreiman. Robust feature-level adver- saries are interpretability tools. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 33093–3...

work page doi:10.23915/distill.00015 2019
[3]

Jean-Stanislas Denain and Jacob Steinhardt

URL https://arxiv.org/abs/2410.08827. Jean-Stanislas Denain and Jacob Steinhardt. Auditing visualizations: Transparency methods struggle to detect anomalous behavior, 2023. URLhttps://arxiv.org/abs/2206.13498. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Alexander Yo...

work page doi:10.1162/tacl_a_00359 2023
[4]

Probing for semantic evidence of composition by means of simple classification tasks

URL https://arxiv.org/abs/2407.14008. Dumitru Erhan, Y. Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Technical Report, University of Montreal, 01 2009. Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. Probing for semantic evidence of composition by means of simple classification tasks. In Proceeding...

work page doi:10.18653/v1/w16-2524 2009
[5]

ISBN 9781450393522

Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533074. URL https://doi.org/10.1145/3531146.3533074. Ryan Greenblatt and Buck Shlegeris. Catching AIs red-handed. Alignment Forum, January 2024. URL https://www.alignmentforum.org/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed. Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninn...

work page doi:10.1145/3531146.3533074 2024
[6]

A Structural Probe for Finding Syntax in Word Representations

URL https://openreview.net/forum?id=NudBMY-tzDr. John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. InEMNLP, 2019. URL https://arxiv.org/abs/1909.03368. John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceeding...

work page doi:10.18653/v1/n19-1419 2019
[7]

Backward Lens: Pro- jecting Language Model Gradients into the Vocabulary Space

doi: 10.23915/distill.00029. https://distill.pub/2020/understanding-rl-vision. Geoffrey Hinton. Shape representation in parallel systems. Proceedings of teh Seventh International Joint Conference on Artificial Intelligence, 1981. URL https://www.cs.toronto.edu/~hinton/absps/ shape81.pdf. Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A Fast Learnin...

work page doi:10.23915/distill.00029 2020
[8]

Understanding Deep Image Representations by Inverting Them

URL https://arxiv.org/abs/1412.0035. Aleksandar Makelov, Georg Lange, and Neel Nanda. Is this the subspace you are looking for? an inter- pretability illusion for subspace activation patching. InNeurIPS Workshop on Attributing Model Behavior at Scale, 2023. URL https://arxiv.org/abs/2311.17030. Aleksandar Makelov, Georg Lange, and Neel Nanda. Towards prin...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/nature12742 2023
[9]

URL https://distill

URL https://arxiv.org/abs/2408.01416. Neel Nanda. Attribution patching: Activation patching at industrial scale, Mar 2023a. URLhttps://www. neelnanda.io/mechanistic-interpretability/attribution-patching. Neel Nanda. Othello-GPT: Reflections on the research process. Alignment Forum , March 2023b. URL https://www.alignmentforum.org/posts/TAz44Lb9n9yf52pv8/ ...

work page doi:10.23915/distill.00007 2020
[10]

arXiv preprint arXiv:2404.05971 , year=

URL https://arxiv.org/abs/2404.05971. Michael T. Pearce, Thomas Dooms, and Alice Rigg. Weight-based decomposition: A case for bilinear mlps,

work page arXiv
[11]

Judea Pearl.Causality

URL https://arxiv.org/abs/2406.03947. Judea Pearl.Causality. Cambridge University Press, 2 edition, 2009. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kon...

work page arXiv 2009
[12]

doi: 10.18653/v1/2023.findings-emnlp.936

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.936. URLhttps: //aclanthology.org/2023.findings-emnlp.936/. Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. In Yoav Goldberg, Zornitsa Kozareva...

work page doi:10.18653/v1/2023.findings-emnlp.936 2023
[13]

David Raposo, Matthew T

URL https://openreview.net/forum?id=zLBlin2zvW. David Raposo, Matthew T. Kaufman, and Anne K. Churchland. A category-free neural population supports evolving demands during decision-making. Nature Neuroscience, 17:1784–1792, 2014. doi: 10.1038/nn

work page doi:10.1038/nn 2014
[14]

why should i trust you?

URL https://doi.org/10.1038/nn.3865. Tilman Rauker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks . In2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp. 464–483, Los Alamitos, CA, USA, February 2023. IEEE Computer Society. doi: 10....

work page doi:10.1038/nn.3865 2023
[15]

Gemini: A Family of Highly Capable Multimodal Models

URL https://arxiv.org/abs/2312.11805. 69 Max Tegmark and Steve Omohundro. Provably safe systems: the only path to controllable agi, 2023. URL https://arxiv.org/abs/2309.01933. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p19-1452 2023
[16]

Representation Engineering: A Top-Down Approach to AI Transparency

doi: 10.1609/aaai.v36i9.21196. URL https://ojs.aaai.org/index.php/AAAI/article/view/ 21196. Roland Simon Zimmermann, Judy Borowski, Robert Geirhos, Matthias Bethge, Thomas S. A. Wallis, and Wieland Brendel. How well do feature visualizations support causal understanding of CNN activations? In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (e...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v36i9.21196 2021
[17]

What isomorphism or what approximation of a neural network (or parts of it) is the best way to express it for the purposes of interpreting it? b

How should we decompose networks into more interpretable constituent parts? a. What isomorphism or what approximation of a neural network (or parts of it) is the best way to express it for the purposes of interpreting it? b. How should we coarse grain neural networks? c. How should we build higher level abstractions on top of low-level network components?

work page
[18]

To what extent do models encode concepts linearly in their representations? b

How true is the linear representation hypothesis? a. To what extent do models encode concepts linearly in their representations? b. How should we characterize representations that are not linearly represented in neural networks? c. What properties of a concept, or of the training distribution, result in a particular concept becoming encoded linearly (or not)?

work page
[19]

Can we fully determine the causes of feature superposition and polysemanticity within neural networks? b

Is the combination of the linear representation hypothesis and superposition the right frame for thinking about computation in neural networks? a. Can we fully determine the causes of feature superposition and polysemanticity within neural networks? b. How should we understand superposition in attention blocks? c. How should we understand cross-layer supe...

work page
[20]

What lies in SDL reconstruction errors? Will the errors converge to zero with methodological progress? b

Can the problems with SDL be overcome? a. What lies in SDL reconstruction errors? Will the errors converge to zero with methodological progress? b. Is sparsity the correct proxy for interpretability? c. Can the approach be scaled to the largest models? d. Does SDL make sense if we don’t believe in the linear representation hypothesis? e. Is sparsity the b...

work page
[21]

How can we identify the underlying functional structure of networks (which defines why acti- vations are located in particular geometric arrangements in activation space)? 75 b

How important is the geometry of activation space for explaining neural network behavior? a. How can we identify the underlying functional structure of networks (which defines why acti- vations are located in particular geometric arrangements in activation space)? 75 b. Must we understand global feature geometry or only local feature geometry in order to ...

work page
[22]

Can we distinguish parts of networks that underlie generalization from parts that underlie memorization? b

Can we connect theories for how neural networks generalize to interpretability? a. Can we distinguish parts of networks that underlie generalization from parts that underlie memorization? b. What mechanisms underlie the relationship between interpretability and generalization? c. Are there connections between adversarial robustness and superposition? d. C...

work page
[23]

Interpretability training: Can we train networks that are interpretable by default at low per- formance cost? b

Can we build intrinsically more interpretable models at low performance cost? How helpful is this? a. Interpretability training: Can we train networks that are interpretable by default at low per- formance cost? b. Interpretable inference: Can we convert already-trained models into forms that are much easier to completely interpret at little performance c...

work page
[24]

How can we avoid imposing human bias to explanations? b

Can we improve on max-activating input data set examples for understanding the causes of network component activations? a. How can we avoid imposing human bias to explanations? b. Can we progress toward deeper descriptions based on internal mechanisms? c. How might we develop interpretation methods that can recognize and work with unfamiliar concepts - co...

work page
[25]

How can we develop attribution methods that capture higher-order effects beyond first-order approximations of model behavior? b

How can we develop attribution methods that faithfully and efficiently compute which network components are important for some downstream metric? a. How can we develop attribution methods that capture higher-order effects beyond first-order approximations of model behavior? b. Is it possible to create perturbation-based methods that don’t force models to ...

work page
[26]

Hydra effect

How can we better measure the downstream effects of model components? a. How can we reliably distinguish between true causal pathways and compensatory effects like the “Hydra effect” when performing interventions? A.1.1c Reverse engineering step 3: Validation of descriptions

work page
[27]

Can we improve on methodologies for evaluating hypotheses through their predictive power on activations of network components? 76 b

Can we improve our ability to validate mechanistic explanations for model behavior in ways that do not depend on researcher intuition and are computationally tractable to use? a. Can we improve on methodologies for evaluating hypotheses through their predictive power on activations of network components? 76 b. Can we develop methodologies for evaluating h...

work page
[28]

model organisms

Can we develop “model organisms” as a community, which are understood deeply, and seen as a test-bed for new unproven interpretability methodologies to be tested?

work page
[29]

Can we establish standardized baselines and benchmarks for comparing different interpretability approaches on real-world, non-cherry-picked tasks, where the ground truth is known?

work page
[30]

stress tests

What would constitute a comprehensive set of “stress tests” for interpretability hypotheses that could reliably detect interpretability illusions?

work page
[31]

How might we design evaluation frameworks that assess interpretability methods on their average case and worst-case performance rather than just best-case scenarios?

work page
[32]

How can we ensure that our understanding of internals generalizes to out-of-distribution inputs? A.1.2 Concept-based interpretability: Identifying network components for given roles

work page
[33]

How can we reliably distinguish causal from merely correlated features when probing neural net- works?

work page
[34]

Can we develop automated systems to generate high-quality probing data sets, reducing the current heavy reliance on human effort?

work page
[35]

What regularization and validation techniques can be used to prevent spurious correlations while ensuring probes find generalizable features?

work page
[36]

How can we improve probing for concepts that may not have clear positive/negative examples? A.1.3 Proceduralizing mechanistic interpretability into circuit discovery pipelines

work page
[37]

Can we develop techniques that build on lower level methods that provide deeper or more complete insights about neural networks?

work page
[38]

How much can we learn from further work in the existing circuit discovery paradigm? a. Should we expect circuit discovery to benefit from further methodological progress in decom- posing neural networks? Will faithfulness go up and explanation description length go down? b. Can we remove the constraint that task definition for circuit discovery is inheren...

work page
[39]

Through automating the generation and testing of arbitrary hypotheses? b

Can we improve on AI automated feature description and validation methods? a. Through automating the generation and testing of arbitrary hypotheses? b. Through describing differences between features? c. Through descriptions of how components interact?

work page
[40]

Can we improve on ACDC-like circuit discovery methods?

work page
[41]

Conceptual interpretability research? b

Can we automate other parts of the mechanistic interpretability pipeline? a. Conceptual interpretability research? b. Decomposition method discovery? c. More ad hoc validation of hypotheses?

work page
[42]

Should we take steps to mitigate potentially misaligned AI systems sabotaging AI automated inter- pretability? A.2 Open problems in applications of mechanistic interpretability A.2.1 Using mechanistic interpretability for better monitoring and auditing of AI systems for potentially unsafe cognition

work page
[43]

white box

Can we effectively use interpretability for safety evaluations? a. Can we develop robust “white box” evaluations that detect concerning internal patterns without needing to understand the entire network? b. Can we reliably distinguish between features that merely recognize deceptive behavior versus mechanisms that generate deceptive behavior? c. How can w...

work page
[44]

Can we use interpretability insights to make red-teaming more efficient than current methods? b

Can we leverage interpretability to enhance red-teaming and system testing? a. Can we use interpretability insights to make red-teaming more efficient than current methods? b. How can we best use feature attribution to help human red-teamers identify problematic input patterns?

work page
[45]

Can we get mechanistic anomaly detection to work? b

Can we develop effective test-time monitoring systems based on interpretability? a. Can we get mechanistic anomaly detection to work? b. Can we create passive monitoring systems based on model internals that effectively flag con- cerning internal patterns during deployment? c. Can we develop monitoring systems that work with only feature-level understandi...

work page
[46]

How can we make activation steering more precise and reduce its side effects? b

Can we improve steering methods through interpretability? a. How can we make activation steering more precise and reduce its side effects? b. Can we develop methods to steer entire mechanisms rather than just single features?

work page
[47]

Will carving the network at its true joints help us improve on model unlearning and editing? b

Can we achieve reliable model unlearning and editing? a. Will carving the network at its true joints help us improve on model unlearning and editing? b. Can mechanistic interpretability help us develop better methods for evaluating unlearning effi- cacy? 78 c. Can mechanistic interpretability help us determine which classes of model edit are even possible...

work page
[48]

Can we make finetuning more sample-efficient by targeting specific parameters? b

Can we better understand and improve finetuning through interpretability? a. Can we make finetuning more sample-efficient by targeting specific parameters? b. Can we develop better tools for analyzing feature-level or mechanism-level differences between model versions? A.2.3 Using mechanistic interpretability for better predictions about AI systems

work page
[49]

values” or “goals

Can we predict model behavior in novel situations outside of the distribution of inputs we have access to with mechanistic understanding? a. Can we reliably predict when and how jailbreaking or safety bypasses might occur? b. How can we identify internal signatures that predict specific failure modes like hallucination? c. Can we develop methods to predic...

work page
[50]

Can current toy model verification approaches scale to frontier systems? b

Can we develop formal verification methods for AI systems? a. Can current toy model verification approaches scale to frontier systems? b. How much of neural computation can be reduced to verifiable symbolic operations? c. Can we create formal guarantees about system behavior in complex, non-formalizable environ- ments? d. What level of mechanistic underst...

work page
[51]

enumerative safety

Can we make rigorous claims about model safety? a. Can we definitively prove the absence of specific dangerous mechanisms? b. How can we verify claims about model values and goals in a rigorous way? c. What types of safety claims are possible with current interpretability methods? d. Can we develop “enumerative safety” approaches that reliably identify al...

work page
[52]

Can we identify early signatures that predict emergent capabilities? b

Can we better predict AI capability development through interpretability? a. Can we identify early signatures that predict emergent capabilities? b. How do model mechanisms evolve dynamically during training? c. Can we map the connection between small-scale circuits and large-scale capabilities? d. How does the loss landscape’s structure relate to capabil...

work page
[53]

How do specific training examples influence the development of model mechanisms? b

Can we understand the relationship between training data and capabilities? a. How do specific training examples influence the development of model mechanisms? b. Can we predict model limitations based on training data composition? c. Can we design training data sets to reliably produce specific desired capabilities? d. How does data set structure affect t...

work page
[54]

How can we identify capabilities that could be ‘unlocked’ through prompting or finetuning? b

Can we predict latent or maskable capabilities? a. How can we identify capabilities that could be ‘unlocked’ through prompting or finetuning? b. Can we detect when finetuning has masked rather than removed capabilities? c. How do we analyze mechanisms that span multiple timesteps or sequential behaviors? d. Can we predict which model capabilities are fund...

work page
[55]

How can we identify skippable computations without affecting outputs? b

Can we use interpretability to make inference more efficient? a. How can we identify skippable computations without affecting outputs? b. Can we create more effective distillation methods through mechanistic understanding? c. How can we optimize model architecture based on component function analysis? d. Can we identify and optimize critical computational...

work page
[56]

Can we better select training data by understanding example influence? b

Can we improve training through mechanistic insights? a. Can we better select training data by understanding example influence? b. How can we monitor and optimize capability emergence during training? c. Can we develop more parameter-efficient training methods through component analysis? d. Can we create better architectures through component understandin...

work page
[57]

Can we design better inductive biases based on mechanistic insights? b

Can we instill capabilities directly into networks? a. Can we design better inductive biases based on mechanistic insights? b. Is it possible to create modular architectures with swappable components? c. Can we develop reliable methods for combining model parameters? d. Is it possible to transfer specific capabilities between models? A.2.5 Using mechanist...

work page
[58]

How can we extract novel patterns and predictors that models have found? b

Can we leverage AI models for scientific discovery? a. How can we extract novel patterns and predictors that models have found? b. Can we make microscope AI techniques accessible to domain experts? c. How do we validate scientific insights derived from model interpretability? d. Can we extend microscope AI beyond current simple correlational discoveries?

work page
[59]

How can we detect when models have found genuinely novel patterns? b

Can we develop better knowledge extraction methods? a. How can we detect when models have found genuinely novel patterns? b. Can we automate the process of finding scientific insights in model weights? c. How do we bridge the gap between model features and scientific concepts? d. Can we make these techniques usable without deep machine learning expertise?...

work page
[60]

Do current interpretability methods (SDL, circuit analysis) transfer to SSMs? Or, like the transition from CNNs to transformers, are new approaches necessary? b

Can interpretability methods generalize across architectures? a. Do current interpretability methods (SDL, circuit analysis) transfer to SSMs? Or, like the transition from CNNs to transformers, are new approaches necessary? b. Which insights are model-specific versus universal? c. How can we adapt methods for multimodal models?

work page
[61]

universality hypothesis

How do different models trained on similar data compare mechanistically? a. Is the “universality hypothesis” true across models? To what extent do neural networks learn similar features and circuits to each other (and to humans?) b. Do different architectures learn fundamentally different features? c. How do mechanisms of particular tasks differ between t...

work page
[62]

How can we prepare for interpreting novel architectures? b

Can we future-proof interpretability research? a. How can we prepare for interpreting novel architectures? b. Should we focus on architecture-specific or general methods? c. Can we identify truly fundamental interpretability principles? d. Will current methods work on future frontier models? A.2.7 Human computer interaction with model internals

work page
[63]

How can we visualize model internals in an intuitive way? b

Can we create interfaces that use mechanistic understanding to enhance human-neural network interaction? a. How can we visualize model internals in an intuitive way? b. Can we develop real-time interpretability dashboards? c. What’s the right balance between simplicity and depth in these interfaces? d. How do we make complex model mechanisms understandabl...

work page
[64]

How can we help auditors find potential failure modes directly? b

Can we develop interpretability tools to help auditors? a. How can we help auditors find potential failure modes directly? b. Can we develop tools to detect bias at the mechanism level? c. What interfaces would make auditing more efficient and thorough? d. How can we present technical findings to policy makers?

work page
[65]

How can transparency features help users calibrate trust? b

Can we improve end-user interaction with AI? a. How can transparency features help users calibrate trust? b. Can we create intuitive controls based on model mechanisms? c. Can we create intuitive ways to steer model behavior? A.2.8 Governance

work page
[66]

Can we identify specific mechanisms that caused AI failures? b

Can mechanistic analysis help identify and prevent failures? a. Can we identify specific mechanisms that caused AI failures? b. How do we map the causal chain of mechanisms leading to incidents? c. Can we detect when similar mechanisms are about to activate? d. Is it possible to isolate and modify failure-causing mechanisms?

work page
[67]

Can we identify mechanisms responsible for specific dangerous capabilities? b

Can we study mechanism patterns related to governance? a. Can we identify mechanisms responsible for specific dangerous capabilities? b. How do we detect deceptive or evasive mechanisms? c. Can we map the mechanisms involved in model decision-making? d. Is it possible to verify the absence of specific harmful mechanisms?

work page
[68]

How can we trace decision mechanisms to explain model outputs? b

Can mechanistic insights verify compliance? a. How can we trace decision mechanisms to explain model outputs? b. Can we identify mechanisms that process copyrighted content? c. Is it possible to detect mechanisms that encode specific knowledge? d. How do we verify modifications to problematic mechanisms? 81 A.2.9 Open socio-technical problems in mechanist...

work page
[69]

How can we use interpretability to improve capability elicitation? b

Can we use a mechanistic understanding to better evaluate AI capabilities? a. How can we use interpretability to improve capability elicitation? b. Can we use interpretability to reliably detect when models are strategically underperforming capabilities evaluations?

work page
[70]

Can we use a mechanistic understanding to improve our ability to forecast when or whether new capabilities will arise ahead of time?

work page
[71]

How can we use interpretability to better estimate the likelihood of different threat models?

work page
[72]

Can we use interpretability to construct reliable test-time monitors to detect AI incidents? b

Can we use interpretability to prevent AI incidents? a. Can we use interpretability to construct reliable test-time monitors to detect AI incidents? b. Can we use reliably prevent similar incidents in the future, by using interpretability to design new evaluation tasks on incident scenarios?

work page
[73]

Can interpretability help verify which workloads GPUs are being used for?

work page
[74]

How should interpretability inform copyright law?

work page
[75]

How can mechanistic understanding help resolve copyright challenges in generative AI, particularly regarding the detection and removal of memorized copyrighted works? A.2.9b Social and philosophical challenges in mechanistic interpretability

work page
[76]

What are the goals of the field? b

What is interpretability? a. What are the goals of the field? b. How should success be graded? c. Should we treat interpretability as a science or an engineering discipline? What implications does this have on what research should be done?

work page
[77]

How can we communicate the results of our research such that the risk of their misuse is minimized? 82

How can we mitigate downside risks of interpretability research? a. How can we communicate the results of our research such that the risk of their misuse is minimized? 82

work page