Recognition: 2 theorem links
· Lean TheoremOpen Problems in Mechanistic Interpretability
Pith reviewed 2026-05-14 18:25 UTC · model grok-4.3
The pith
Mechanistic interpretability must solve open problems in methods, applications, and socio-technical challenges to achieve its goals of AI assurance and scientific insight.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that progress toward the goals of mechanistic interpretability—providing assurance over AI behavior and illuminating the nature of intelligence—requires solutions to open problems in three areas: improving methods to reveal deeper insights, determining how to best apply methods for concrete objectives, and tackling socio-technical issues influenced by and influencing the work.
What carries the argument
The three-category framework of open problems (methods improvements, application strategies, socio-technical challenges) that organizes the frontier and identifies priorities for future work.
If this is right
- Improved methods will uncover deeper computational mechanisms in neural networks.
- Application strategies will guide the use of interpretability toward specific scientific and engineering aims.
- Resolving socio-technical challenges will allow the field to navigate influences from society and ethics.
- Overall progress will lead to greater assurance over AI system behavior.
Where Pith is reading between the lines
- Prioritizing these problems could lead to the development of standardized evaluation metrics for interpretability techniques.
- Connections might form with neuroscience as questions about intelligence are addressed.
- Testable extensions include community efforts to solve one problem at a time and measure resulting gains in AI understanding.
- The review suggests the field would benefit from collaborative roadmaps based on these open problems.
Load-bearing premise
That solving the identified open problems will directly result in greater assurance over AI system behavior and new insights into the nature of intelligence.
What would settle it
Researchers solve several of the listed open problems in methods and applications yet observe no improvement in their ability to predict or assure specific behaviors in trained neural networks.
read the original abstract
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a forward-facing review of mechanistic interpretability that outlines the field's aim to reverse-engineer computational mechanisms in neural networks for concrete scientific goals (insights into intelligence) and engineering goals (assurance over AI behavior). It catalogs open problems across three categories—methods (conceptual and practical improvements needed to reveal deeper insights), applications (how best to apply methods to specific goals), and socio-technical challenges—and argues that solving these is required before the promised benefits can be realized.
Significance. If the identified problems are addressed, the review could meaningfully guide prioritization in the field, accelerating progress toward verifiable assurance in AI systems and scientific understanding of learned representations. Its value lies in the structured, comprehensive catalog of open issues drawn from the current frontier; this agenda-setting function is a strength for a review paper and can help coordinate research efforts without introducing new empirical claims or derivations.
minor comments (2)
- [Abstract] The abstract effectively motivates the review but could include one sentence on the paper's organizational structure (e.g., how the three categories of open problems are sequenced) to improve reader navigation.
- Some problem descriptions in the methods section would benefit from a short example or citation to a concrete recent paper illustrating the gap, to make the open-problem statements more actionable for readers.
Simulated Author's Rebuttal
We thank the referee for their positive and accurate summary of the manuscript, as well as their recommendation to accept. We appreciate the recognition that the paper's primary contribution is its structured catalog of open problems across methods, applications, and socio-technical challenges, which can help coordinate research efforts in mechanistic interpretability.
Circularity Check
No significant circularity; purely descriptive review of open problems
full rationale
This paper is a forward-facing review that catalogs open problems in mechanistic interpretability without presenting any mathematical derivations, empirical predictions, fitted models, or load-bearing technical claims. Its central statements are motivational framing about the field's potential benefits and the need to address listed challenges in methods, applications, and socio-technical issues. No self-definitional steps, fitted inputs renamed as predictions, or self-citation chains exist that could reduce any result to its own inputs by construction. The document is self-contained as an agenda-setting review with no internal derivation chain to inspect for circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 25 Pith papers
-
Tracing Persona Vectors Through LLM Pretraining
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
-
From Mechanistic to Compositional Interpretability
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaran...
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Linear-Readout Floors and Threshold Recovery in Computation in Superposition
Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contr...
-
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
-
Diverse Dictionary Learning
Diverse dictionary learning identifies intersections, complements, and dependency structures of latent variables from data X = g(Z) up to indeterminacies, and full identifiability when structural diversity is sufficient.
-
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
Bilinear autoencoders find interpretable manifolds
Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation...
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 7...
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
Circuit analysis reveals that routing circuits for agent memory emerge at 0.6B parameters while content circuits emerge at 4B, with a shared grounding hub and an unsupervised diagnostic achieving 76.2% accuracy for lo...
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.
-
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...
-
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
-
Functional Similarity Metric for Neural Networks: Overcoming Parametric Ambiguity via Activation Region Analysis
A functional similarity metric for ReLU networks uses normalized activation region signatures and MinHash to overcome parametric symmetries like neuron permutation and scaling.
-
Metaphor Is Not All Attention Needs
Poetic jailbreaks succeed because they induce distinct attention patterns in LLMs that are independent of harmful-content detection, not because models fail to recognize literary formatting.
-
Do Linear Probes Generalize Better in Persona Coordinates?
Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
-
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
-
High-Dimensional Statistics: Reflections on Progress and Open Problems
A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...
Reference graph
Works this paper leans on
-
[1]
URL https://assets.anthropic.com/m/24a47b00f10301cd/original/ Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf . Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, ...
-
[2]
https://distill.pub/2019/activation-atlas
doi: 10.23915/distill.00015. https://distill.pub/2019/activation-atlas. Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, and Gabriel Kreiman. Robust feature-level adver- saries are interpretability tools. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 33093–3...
-
[3]
Jean-Stanislas Denain and Jacob Steinhardt
URL https://arxiv.org/abs/2410.08827. Jean-Stanislas Denain and Jacob Steinhardt. Auditing visualizations: Transparency methods struggle to detect anomalous behavior, 2023. URLhttps://arxiv.org/abs/2206.13498. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Alexander Yo...
-
[4]
Probing for semantic evidence of composition by means of simple classification tasks
URL https://arxiv.org/abs/2407.14008. Dumitru Erhan, Y. Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Technical Report, University of Montreal, 01 2009. Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. Probing for semantic evidence of composition by means of simple classification tasks. In Proceeding...
-
[5]
Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533074. URL https://doi.org/10.1145/3531146.3533074. Ryan Greenblatt and Buck Shlegeris. Catching AIs red-handed. Alignment Forum, January 2024. URL https://www.alignmentforum.org/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed. Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninn...
-
[6]
A Structural Probe for Finding Syntax in Word Representations
URL https://openreview.net/forum?id=NudBMY-tzDr. John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. InEMNLP, 2019. URL https://arxiv.org/abs/1909.03368. John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.),Proceeding...
-
[7]
Backward Lens: Pro- jecting Language Model Gradients into the Vocabulary Space
doi: 10.23915/distill.00029. https://distill.pub/2020/understanding-rl-vision. Geoffrey Hinton. Shape representation in parallel systems. Proceedings of teh Seventh International Joint Conference on Artificial Intelligence, 1981. URL https://www.cs.toronto.edu/~hinton/absps/ shape81.pdf. Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A Fast Learnin...
-
[8]
Understanding Deep Image Representations by Inverting Them
URL https://arxiv.org/abs/1412.0035. Aleksandar Makelov, Georg Lange, and Neel Nanda. Is this the subspace you are looking for? an inter- pretability illusion for subspace activation patching. InNeurIPS Workshop on Attributing Model Behavior at Scale, 2023. URL https://arxiv.org/abs/2311.17030. Aleksandar Makelov, Georg Lange, and Neel Nanda. Towards prin...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/nature12742 2023
-
[9]
URL https://arxiv.org/abs/2408.01416. Neel Nanda. Attribution patching: Activation patching at industrial scale, Mar 2023a. URLhttps://www. neelnanda.io/mechanistic-interpretability/attribution-patching. Neel Nanda. Othello-GPT: Reflections on the research process. Alignment Forum , March 2023b. URL https://www.alignmentforum.org/posts/TAz44Lb9n9yf52pv8/ ...
-
[10]
arXiv preprint arXiv:2404.05971 , year=
URL https://arxiv.org/abs/2404.05971. Michael T. Pearce, Thomas Dooms, and Alice Rigg. Weight-based decomposition: A case for bilinear mlps,
-
[11]
URL https://arxiv.org/abs/2406.03947. Judea Pearl.Causality. Cambridge University Press, 2 edition, 2009. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kon...
-
[12]
doi: 10.18653/v1/2023.findings-emnlp.936
Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.936. URLhttps: //aclanthology.org/2023.findings-emnlp.936/. Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. In Yoav Goldberg, Zornitsa Kozareva...
-
[13]
URL https://openreview.net/forum?id=zLBlin2zvW. David Raposo, Matthew T. Kaufman, and Anne K. Churchland. A category-free neural population supports evolving demands during decision-making. Nature Neuroscience, 17:1784–1792, 2014. doi: 10.1038/nn
work page doi:10.1038/nn 2014
-
[14]
URL https://doi.org/10.1038/nn.3865. Tilman Rauker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks . In2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp. 464–483, Los Alamitos, CA, USA, February 2023. IEEE Computer Society. doi: 10....
-
[15]
Gemini: A Family of Highly Capable Multimodal Models
URL https://arxiv.org/abs/2312.11805. 69 Max Tegmark and Steve Omohundro. Provably safe systems: the only path to controllable agi, 2023. URL https://arxiv.org/abs/2309.01933. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p19-1452 2023
-
[16]
Representation Engineering: A Top-Down Approach to AI Transparency
doi: 10.1609/aaai.v36i9.21196. URL https://ojs.aaai.org/index.php/AAAI/article/view/ 21196. Roland Simon Zimmermann, Judy Borowski, Robert Geirhos, Matthias Bethge, Thomas S. A. Wallis, and Wieland Brendel. How well do feature visualizations support causal understanding of CNN activations? In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (e...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v36i9.21196 2021
-
[17]
How should we decompose networks into more interpretable constituent parts? a. What isomorphism or what approximation of a neural network (or parts of it) is the best way to express it for the purposes of interpreting it? b. How should we coarse grain neural networks? c. How should we build higher level abstractions on top of low-level network components?
-
[18]
To what extent do models encode concepts linearly in their representations? b
How true is the linear representation hypothesis? a. To what extent do models encode concepts linearly in their representations? b. How should we characterize representations that are not linearly represented in neural networks? c. What properties of a concept, or of the training distribution, result in a particular concept becoming encoded linearly (or not)?
-
[19]
Is the combination of the linear representation hypothesis and superposition the right frame for thinking about computation in neural networks? a. Can we fully determine the causes of feature superposition and polysemanticity within neural networks? b. How should we understand superposition in attention blocks? c. How should we understand cross-layer supe...
-
[20]
Can the problems with SDL be overcome? a. What lies in SDL reconstruction errors? Will the errors converge to zero with methodological progress? b. Is sparsity the correct proxy for interpretability? c. Can the approach be scaled to the largest models? d. Does SDL make sense if we don’t believe in the linear representation hypothesis? e. Is sparsity the b...
-
[21]
How important is the geometry of activation space for explaining neural network behavior? a. How can we identify the underlying functional structure of networks (which defines why acti- vations are located in particular geometric arrangements in activation space)? 75 b. Must we understand global feature geometry or only local feature geometry in order to ...
-
[22]
Can we connect theories for how neural networks generalize to interpretability? a. Can we distinguish parts of networks that underlie generalization from parts that underlie memorization? b. What mechanisms underlie the relationship between interpretability and generalization? c. Are there connections between adversarial robustness and superposition? d. C...
-
[23]
Can we build intrinsically more interpretable models at low performance cost? How helpful is this? a. Interpretability training: Can we train networks that are interpretable by default at low per- formance cost? b. Interpretable inference: Can we convert already-trained models into forms that are much easier to completely interpret at little performance c...
-
[24]
How can we avoid imposing human bias to explanations? b
Can we improve on max-activating input data set examples for understanding the causes of network component activations? a. How can we avoid imposing human bias to explanations? b. Can we progress toward deeper descriptions based on internal mechanisms? c. How might we develop interpretation methods that can recognize and work with unfamiliar concepts - co...
-
[25]
How can we develop attribution methods that faithfully and efficiently compute which network components are important for some downstream metric? a. How can we develop attribution methods that capture higher-order effects beyond first-order approximations of model behavior? b. Is it possible to create perturbation-based methods that don’t force models to ...
-
[26]
How can we better measure the downstream effects of model components? a. How can we reliably distinguish between true causal pathways and compensatory effects like the “Hydra effect” when performing interventions? A.1.1c Reverse engineering step 3: Validation of descriptions
-
[27]
Can we improve our ability to validate mechanistic explanations for model behavior in ways that do not depend on researcher intuition and are computationally tractable to use? a. Can we improve on methodologies for evaluating hypotheses through their predictive power on activations of network components? 76 b. Can we develop methodologies for evaluating h...
-
[28]
Can we develop “model organisms” as a community, which are understood deeply, and seen as a test-bed for new unproven interpretability methodologies to be tested?
-
[29]
Can we establish standardized baselines and benchmarks for comparing different interpretability approaches on real-world, non-cherry-picked tasks, where the ground truth is known?
-
[30]
What would constitute a comprehensive set of “stress tests” for interpretability hypotheses that could reliably detect interpretability illusions?
-
[31]
How might we design evaluation frameworks that assess interpretability methods on their average case and worst-case performance rather than just best-case scenarios?
-
[32]
How can we ensure that our understanding of internals generalizes to out-of-distribution inputs? A.1.2 Concept-based interpretability: Identifying network components for given roles
-
[33]
How can we reliably distinguish causal from merely correlated features when probing neural net- works?
-
[34]
Can we develop automated systems to generate high-quality probing data sets, reducing the current heavy reliance on human effort?
-
[35]
What regularization and validation techniques can be used to prevent spurious correlations while ensuring probes find generalizable features?
-
[36]
How can we improve probing for concepts that may not have clear positive/negative examples? A.1.3 Proceduralizing mechanistic interpretability into circuit discovery pipelines
-
[37]
Can we develop techniques that build on lower level methods that provide deeper or more complete insights about neural networks?
-
[38]
How much can we learn from further work in the existing circuit discovery paradigm? a. Should we expect circuit discovery to benefit from further methodological progress in decom- posing neural networks? Will faithfulness go up and explanation description length go down? b. Can we remove the constraint that task definition for circuit discovery is inheren...
-
[39]
Through automating the generation and testing of arbitrary hypotheses? b
Can we improve on AI automated feature description and validation methods? a. Through automating the generation and testing of arbitrary hypotheses? b. Through describing differences between features? c. Through descriptions of how components interact?
-
[40]
Can we improve on ACDC-like circuit discovery methods?
-
[41]
Conceptual interpretability research? b
Can we automate other parts of the mechanistic interpretability pipeline? a. Conceptual interpretability research? b. Decomposition method discovery? c. More ad hoc validation of hypotheses?
-
[42]
Should we take steps to mitigate potentially misaligned AI systems sabotaging AI automated inter- pretability? A.2 Open problems in applications of mechanistic interpretability A.2.1 Using mechanistic interpretability for better monitoring and auditing of AI systems for potentially unsafe cognition
-
[43]
Can we effectively use interpretability for safety evaluations? a. Can we develop robust “white box” evaluations that detect concerning internal patterns without needing to understand the entire network? b. Can we reliably distinguish between features that merely recognize deceptive behavior versus mechanisms that generate deceptive behavior? c. How can w...
-
[44]
Can we use interpretability insights to make red-teaming more efficient than current methods? b
Can we leverage interpretability to enhance red-teaming and system testing? a. Can we use interpretability insights to make red-teaming more efficient than current methods? b. How can we best use feature attribution to help human red-teamers identify problematic input patterns?
-
[45]
Can we get mechanistic anomaly detection to work? b
Can we develop effective test-time monitoring systems based on interpretability? a. Can we get mechanistic anomaly detection to work? b. Can we create passive monitoring systems based on model internals that effectively flag con- cerning internal patterns during deployment? c. Can we develop monitoring systems that work with only feature-level understandi...
-
[46]
How can we make activation steering more precise and reduce its side effects? b
Can we improve steering methods through interpretability? a. How can we make activation steering more precise and reduce its side effects? b. Can we develop methods to steer entire mechanisms rather than just single features?
-
[47]
Will carving the network at its true joints help us improve on model unlearning and editing? b
Can we achieve reliable model unlearning and editing? a. Will carving the network at its true joints help us improve on model unlearning and editing? b. Can mechanistic interpretability help us develop better methods for evaluating unlearning effi- cacy? 78 c. Can mechanistic interpretability help us determine which classes of model edit are even possible...
-
[48]
Can we make finetuning more sample-efficient by targeting specific parameters? b
Can we better understand and improve finetuning through interpretability? a. Can we make finetuning more sample-efficient by targeting specific parameters? b. Can we develop better tools for analyzing feature-level or mechanism-level differences between model versions? A.2.3 Using mechanistic interpretability for better predictions about AI systems
-
[49]
Can we predict model behavior in novel situations outside of the distribution of inputs we have access to with mechanistic understanding? a. Can we reliably predict when and how jailbreaking or safety bypasses might occur? b. How can we identify internal signatures that predict specific failure modes like hallucination? c. Can we develop methods to predic...
-
[50]
Can current toy model verification approaches scale to frontier systems? b
Can we develop formal verification methods for AI systems? a. Can current toy model verification approaches scale to frontier systems? b. How much of neural computation can be reduced to verifiable symbolic operations? c. Can we create formal guarantees about system behavior in complex, non-formalizable environ- ments? d. What level of mechanistic underst...
-
[51]
Can we make rigorous claims about model safety? a. Can we definitively prove the absence of specific dangerous mechanisms? b. How can we verify claims about model values and goals in a rigorous way? c. What types of safety claims are possible with current interpretability methods? d. Can we develop “enumerative safety” approaches that reliably identify al...
-
[52]
Can we identify early signatures that predict emergent capabilities? b
Can we better predict AI capability development through interpretability? a. Can we identify early signatures that predict emergent capabilities? b. How do model mechanisms evolve dynamically during training? c. Can we map the connection between small-scale circuits and large-scale capabilities? d. How does the loss landscape’s structure relate to capabil...
-
[53]
How do specific training examples influence the development of model mechanisms? b
Can we understand the relationship between training data and capabilities? a. How do specific training examples influence the development of model mechanisms? b. Can we predict model limitations based on training data composition? c. Can we design training data sets to reliably produce specific desired capabilities? d. How does data set structure affect t...
-
[54]
How can we identify capabilities that could be ‘unlocked’ through prompting or finetuning? b
Can we predict latent or maskable capabilities? a. How can we identify capabilities that could be ‘unlocked’ through prompting or finetuning? b. Can we detect when finetuning has masked rather than removed capabilities? c. How do we analyze mechanisms that span multiple timesteps or sequential behaviors? d. Can we predict which model capabilities are fund...
-
[55]
How can we identify skippable computations without affecting outputs? b
Can we use interpretability to make inference more efficient? a. How can we identify skippable computations without affecting outputs? b. Can we create more effective distillation methods through mechanistic understanding? c. How can we optimize model architecture based on component function analysis? d. Can we identify and optimize critical computational...
-
[56]
Can we better select training data by understanding example influence? b
Can we improve training through mechanistic insights? a. Can we better select training data by understanding example influence? b. How can we monitor and optimize capability emergence during training? c. Can we develop more parameter-efficient training methods through component analysis? d. Can we create better architectures through component understandin...
-
[57]
Can we design better inductive biases based on mechanistic insights? b
Can we instill capabilities directly into networks? a. Can we design better inductive biases based on mechanistic insights? b. Is it possible to create modular architectures with swappable components? c. Can we develop reliable methods for combining model parameters? d. Is it possible to transfer specific capabilities between models? A.2.5 Using mechanist...
-
[58]
How can we extract novel patterns and predictors that models have found? b
Can we leverage AI models for scientific discovery? a. How can we extract novel patterns and predictors that models have found? b. Can we make microscope AI techniques accessible to domain experts? c. How do we validate scientific insights derived from model interpretability? d. Can we extend microscope AI beyond current simple correlational discoveries?
-
[59]
How can we detect when models have found genuinely novel patterns? b
Can we develop better knowledge extraction methods? a. How can we detect when models have found genuinely novel patterns? b. Can we automate the process of finding scientific insights in model weights? c. How do we bridge the gap between model features and scientific concepts? d. Can we make these techniques usable without deep machine learning expertise?...
-
[60]
Can interpretability methods generalize across architectures? a. Do current interpretability methods (SDL, circuit analysis) transfer to SSMs? Or, like the transition from CNNs to transformers, are new approaches necessary? b. Which insights are model-specific versus universal? c. How can we adapt methods for multimodal models?
-
[61]
How do different models trained on similar data compare mechanistically? a. Is the “universality hypothesis” true across models? To what extent do neural networks learn similar features and circuits to each other (and to humans?) b. Do different architectures learn fundamentally different features? c. How do mechanisms of particular tasks differ between t...
-
[62]
How can we prepare for interpreting novel architectures? b
Can we future-proof interpretability research? a. How can we prepare for interpreting novel architectures? b. Should we focus on architecture-specific or general methods? c. Can we identify truly fundamental interpretability principles? d. Will current methods work on future frontier models? A.2.7 Human computer interaction with model internals
-
[63]
How can we visualize model internals in an intuitive way? b
Can we create interfaces that use mechanistic understanding to enhance human-neural network interaction? a. How can we visualize model internals in an intuitive way? b. Can we develop real-time interpretability dashboards? c. What’s the right balance between simplicity and depth in these interfaces? d. How do we make complex model mechanisms understandabl...
-
[64]
How can we help auditors find potential failure modes directly? b
Can we develop interpretability tools to help auditors? a. How can we help auditors find potential failure modes directly? b. Can we develop tools to detect bias at the mechanism level? c. What interfaces would make auditing more efficient and thorough? d. How can we present technical findings to policy makers?
-
[65]
How can transparency features help users calibrate trust? b
Can we improve end-user interaction with AI? a. How can transparency features help users calibrate trust? b. Can we create intuitive controls based on model mechanisms? c. Can we create intuitive ways to steer model behavior? A.2.8 Governance
-
[66]
Can we identify specific mechanisms that caused AI failures? b
Can mechanistic analysis help identify and prevent failures? a. Can we identify specific mechanisms that caused AI failures? b. How do we map the causal chain of mechanisms leading to incidents? c. Can we detect when similar mechanisms are about to activate? d. Is it possible to isolate and modify failure-causing mechanisms?
-
[67]
Can we identify mechanisms responsible for specific dangerous capabilities? b
Can we study mechanism patterns related to governance? a. Can we identify mechanisms responsible for specific dangerous capabilities? b. How do we detect deceptive or evasive mechanisms? c. Can we map the mechanisms involved in model decision-making? d. Is it possible to verify the absence of specific harmful mechanisms?
-
[68]
How can we trace decision mechanisms to explain model outputs? b
Can mechanistic insights verify compliance? a. How can we trace decision mechanisms to explain model outputs? b. Can we identify mechanisms that process copyrighted content? c. Is it possible to detect mechanisms that encode specific knowledge? d. How do we verify modifications to problematic mechanisms? 81 A.2.9 Open socio-technical problems in mechanist...
-
[69]
How can we use interpretability to improve capability elicitation? b
Can we use a mechanistic understanding to better evaluate AI capabilities? a. How can we use interpretability to improve capability elicitation? b. Can we use interpretability to reliably detect when models are strategically underperforming capabilities evaluations?
-
[70]
Can we use a mechanistic understanding to improve our ability to forecast when or whether new capabilities will arise ahead of time?
-
[71]
How can we use interpretability to better estimate the likelihood of different threat models?
-
[72]
Can we use interpretability to construct reliable test-time monitors to detect AI incidents? b
Can we use interpretability to prevent AI incidents? a. Can we use interpretability to construct reliable test-time monitors to detect AI incidents? b. Can we use reliably prevent similar incidents in the future, by using interpretability to design new evaluation tasks on incident scenarios?
-
[73]
Can interpretability help verify which workloads GPUs are being used for?
-
[74]
How should interpretability inform copyright law?
-
[75]
How can mechanistic understanding help resolve copyright challenges in generative AI, particularly regarding the detection and removal of memorized copyrighted works? A.2.9b Social and philosophical challenges in mechanistic interpretability
-
[76]
What are the goals of the field? b
What is interpretability? a. What are the goals of the field? b. How should success be graded? c. Should we treat interpretability as a science or an engineering discipline? What implications does this have on what research should be done?
-
[77]
How can we mitigate downside risks of interpretability research? a. How can we communicate the results of our research such that the risk of their misuse is minimized? 82
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.