Mechanistic Interpretability for AI Safety -- A Review

Efstratios Gavves; Leonard Bereska

arxiv: 2404.14082 · v3 · pith:IA2V3ZBSnew · submitted 2024-04-22 · 💻 cs.AI

Mechanistic Interpretability for AI Safety -- A Review

Leonard Bereska , Efstratios Gavves This is my paper

Pith reviewed 2026-05-22 14:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords mechanistic interpretabilityAI safetyneural networkscausal understandingreverse engineeringvalue alignmentscalabilitymodel behaviors

0 comments

The pith

Reverse engineering neural networks into human-understandable algorithms can provide the causal understanding needed to make advanced AI systems safe.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews mechanistic interpretability as a method to understand AI by breaking down how neural networks represent and compute knowledge into features and algorithms. It argues this approach is essential for AI safety because it allows granular, causal insights that can prevent misalignment or unintended behaviors as models grow more powerful. The review covers foundational concepts such as features within activations and hypotheses about their roles, surveys methods for causally dissecting behaviors, and assesses benefits for understanding, control, and alignment alongside risks like capability gains. It identifies challenges in scalability and automation while advocating for clearer concepts, standards, and expansion to domains like vision and reinforcement learning.

Core claim

Mechanistic interpretability involves reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts, providing a granular, causal understanding of model behaviors that is critical for ensuring value alignment and safety in AI systems.

What carries the argument

Mechanistic interpretability, defined as reverse-engineering neural network computations and representations into human-understandable algorithms and concepts.

If this is right

Granular causal understanding enables targeted interventions to align model behaviors with human values.
Dissection of computations supports proactive control to reduce risks of unintended or harmful actions.
Identification of safety-relevant mechanisms helps manage dual-use concerns in interpretability tools.
Expansion to new domains like vision and reinforcement learning extends safety benefits beyond language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If successful at scale, these methods could enable ongoing audits of deployed systems to catch emergent unsafe behaviors before they cause harm.
Success in one architecture might generalize to others, allowing shared safety insights across different model families.
Combining mechanistic insights with other safety techniques could create layered defenses against catastrophic outcomes.

Load-bearing premise

Reverse-engineering neural network computations into human-understandable algorithms and concepts is feasible at scale for complex models and behaviors.

What would settle it

A demonstration that a large language model exhibits a critical safety failure, such as generating harmful outputs from an unknown internal circuit, that cannot be explained or mitigated through any identified mechanistic features or interventions.

read the original abstract

Understanding AI systems' inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. We establish foundational concepts such as features encoding knowledge within neural activations and hypotheses about their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety. We examine benefits in understanding, control, alignment, and risks such as capability gains and dual-use concerns. We investigate challenges surrounding scalability, automation, and comprehensive interpretation. We advocate for clarifying concepts, setting standards, and scaling techniques to handle complex models and behaviors and expand to domains such as vision and reinforcement learning. Mechanistic interpretability could help prevent catastrophic outcomes as AI systems become more powerful and inscrutable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a solid overview of mechanistic interpretability work but rests its safety claims on an unproven scaling assumption without new evidence.

read the letter

This review pulls together existing ideas on mechanistic interpretability and its potential role in AI safety. It does not present new experiments, derivations, or techniques. Instead it organizes concepts like features in activations, causal intervention methods, and hypotheses about representations, then links them to safety goals such as alignment and control. The authors also flag dual-use risks and the practical difficulties of applying these tools at scale. That synthesis is the paper's main contribution and it is done clearly enough to serve as an entry point for readers new to the area. The sections on challenges in automation and comprehensive coverage of large models are direct and proportionate to what the cited literature actually shows. The central safety argument, however, still hinges on the idea that reverse-engineering can be extended to frontier-scale systems. The paper notes the current limits around toy models and circuit enumeration but offers no fresh data or concrete paths showing how the surveyed methods overcome those limits. This leaves the claim that granular causal understanding could prevent catastrophic outcomes as a projection rather than a supported inference. The citation pattern covers the main lines of work without obvious gaps or heavy self-reference. For someone wanting a structured map of the field or a starting point for thinking about research priorities in interpretability, the paper is worth reading. It is not a breakthrough but it is a useful consolidation. I would send it to peer review so the synthesis can be tightened and the feasibility discussion made more precise.

Referee Report

2 major / 2 minor

Summary. This review synthesizes mechanistic interpretability research, defining it as reverse-engineering neural network computations and representations into human-understandable algorithms and concepts to enable granular causal understanding. It covers foundational ideas such as features in activations and hypotheses on representation/computation, surveys causal dissection methods, evaluates benefits for understanding/control/alignment alongside risks like capability enhancement and dual-use, examines challenges in scalability/automation/comprehensive coverage, and advocates clarifying concepts, setting standards, and scaling techniques to complex models and domains including vision and reinforcement learning. The paper concludes that mechanistic interpretability could help prevent catastrophic outcomes as AI systems grow more powerful and inscrutable.

Significance. If the literature synthesis is accurate, the review offers a structured overview that connects mechanistic interpretability techniques to AI safety goals while acknowledging practical limits. It gives credit to progress on toy models and circuit-level analyses, which provides a foundation for discussing safety applications, though the safety relevance remains framed as a forward-looking possibility rather than a demonstrated outcome.

major comments (2)

[Abstract and Relevance to AI Safety section] Abstract and section assessing relevance to AI safety: the claim that mechanistic interpretability 'could help prevent catastrophic outcomes' rests on the feasibility of scaling reverse-engineering to frontier models, yet the review notes intractability of circuit enumeration for large systems without synthesizing specific evidence or extensions from the cited literature showing how current methods (e.g., activation patching or causal interventions) generalize beyond toy scales.
[Challenges section] Section investigating challenges surrounding scalability, automation, and comprehensive interpretation: the advocacy for 'scaling techniques' and 'standards' is presented as a solution path, but lacks concrete discussion of how automation or partial interpretation approaches address the acknowledged gap between current capabilities on small models and the requirements for 100B+ parameter systems.

minor comments (2)

[Abstract] The abstract packs multiple topics into a single paragraph; splitting the summary of methods from the safety assessment and advocacy would improve readability.
[Methodologies survey] Some methodological descriptions could include brief pointers to key example papers or figures to help readers trace the surveyed techniques.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our review. We have carefully considered the major comments regarding the strength of claims about AI safety benefits and the concreteness of proposed solutions in the challenges section. Below we respond point by point and indicate revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and Relevance to AI Safety section] Abstract and section assessing relevance to AI safety: the claim that mechanistic interpretability 'could help prevent catastrophic outcomes' rests on the feasibility of scaling reverse-engineering to frontier models, yet the review notes intractability of circuit enumeration for large systems without synthesizing specific evidence or extensions from the cited literature showing how current methods (e.g., activation patching or causal interventions) generalize beyond toy scales.

Authors: We agree that the safety relevance is prospective and that the manuscript must more explicitly connect the acknowledged scalability limits to cited evidence of progress. The review already notes the intractability of exhaustive circuit enumeration at frontier scales and frames benefits as forward-looking. To address this comment directly, we will revise the abstract and the relevance section to synthesize specific extensions from the literature, including recent applications of activation patching and causal interventions to models beyond toy scales (e.g., work on larger transformers and multimodal systems). This will clarify the basis for cautious optimism without overstating current capabilities. revision: yes
Referee: [Challenges section] Section investigating challenges surrounding scalability, automation, and comprehensive interpretation: the advocacy for 'scaling techniques' and 'standards' is presented as a solution path, but lacks concrete discussion of how automation or partial interpretation approaches address the acknowledged gap between current capabilities on small models and the requirements for 100B+ parameter systems.

Authors: The referee correctly identifies an opportunity to make the discussion of solutions more concrete. The manuscript discusses automation and partial interpretation as necessary directions but does not provide sufficient detail on specific techniques or their demonstrated reach. In the revised version we will expand the challenges section with concrete examples drawn from the surveyed literature, including automated circuit discovery pipelines, sparse autoencoder-based feature extraction, and partial interpretation methods that have been applied to models in the 1B–10B parameter range. We will also explicitly discuss the remaining gap to 100B+ systems and the role of standards in guiding future work. revision: yes

Circularity Check

0 steps flagged

Review paper presents no derivations or predictions that reduce to inputs

full rationale

This is a literature review summarizing external concepts, methods, challenges, and safety relevance of mechanistic interpretability without any original equations, fitted parameters, predictions, or first-principles derivations. The abstract and body survey prior work, note scalability issues, and advocate for future standards and scaling, but make no claims that reduce by construction to self-referential definitions or self-citations. The safety relevance is framed as a potential outcome based on reviewed literature rather than a derived result internal to the paper. No load-bearing steps exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a review the paper introduces no new free parameters, axioms, or invented entities; it relies on standard concepts from the mechanistic interpretability literature it cites.

pith-pipeline@v0.9.0 · 5670 in / 1033 out tokens · 35635 ms · 2026-05-22T14:17:17.903723+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding.
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Superposition hypothesis: neural networks represent more features than they have neurons by encoding features in overlapping combinations of neurons.
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We advocate for clarifying concepts, setting standards, and scaling techniques to handle complex models and behaviors.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
cs.LG 2026-05 unverdicted novelty 7.0

GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning
cs.LG 2026-05 unverdicted novelty 7.0

A framework using sparse autoencoders decomposes concept-level forgetting in supervised continual learning into apparent deletion, recoverability, and decodability, showing substantial recoverability under linearity a...
Data-driven Circuit Discovery for Interpretability of Language Models
cs.AI 2026-05 unverdicted novelty 7.0

Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 7.0

SoftSAE introduces a dynamic top-k selection mechanism in sparse autoencoders that learns an input-dependent sparsity level via a differentiable soft top-k operator.
Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks
cs.NE 2026-05 unverdicted novelty 7.0

Multi-hop graph analysis of RNNs reveals temporal information routing and motivates resolvent regularization that outperforms L1 by enforcing pathway-level sparsity aligned with task structure.
Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks
cs.NE 2026-05 unverdicted novelty 7.0

RNN computation is recovered from multi-hop graph pathways, and constraining these pathways via resolvent regularization yields improved temporal sparsity and task performance over standard L1.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
Task complexity shapes internal representations and robustness in neural networks
cs.LG 2025-08 unverdicted novelty 7.0

Harder classification tasks produce neural representations whose accuracy collapses under binarization and shuffling while easier tasks remain robust, defining task complexity via the performance gap between full-prec...
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
cs.AI 2025-03 conditional novelty 7.0

Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks
cs.LG 2026-05 unverdicted novelty 6.0

XWP and XWP_c are novel attribution methods for FCNNs that estimate feature importance by perturbing attached weights to avoid added bias and out-of-distribution issues in occlusion approaches.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

SoftSAE replaces fixed-K sparsity in autoencoders with a learned, input-dependent number of active features via a soft top-k operator.
Confidence Estimation in Automatic Short Answer Grading with LLMs
cs.CL 2026-04 unverdicted novelty 6.0

A hybrid confidence framework for LLM-based short answer grading combines model signals with aleatoric uncertainty from semantic clustering of responses and improves selective grading reliability over single-source methods.
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
cs.LG 2026-04 unverdicted novelty 6.0

Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
When AI reviews science: Can we trust the referee?
cs.AI 2026-04 unverdicted novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
What Physics do Data-Driven MoCap-to-Radar Models Learn?
cs.LG 2026-04 unverdicted novelty 6.0

Data-driven MoCap-to-radar models often fail to learn underlying physics despite low reconstruction error, with temporal attention proving critical for transformers to achieve physical consistency.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
cs.LG 2026-04 unverdicted novelty 6.0

Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
Quantifying Trust: Financial Risk Management for Trustworthy AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.
On the definition and importance of interpretability in scientific machine learning
cs.LG 2025-05 conditional novelty 6.0

Interpretability in SciML requires mechanistic understanding rather than sparsity, and prior knowledge is often essential for interpretable scientific discovery.
Superposition Yields Robust Neural Scaling
cs.LG 2025-05 conditional novelty 6.0

Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
Confidence Estimation in Automatic Short Answer Grading with LLMs
cs.CL 2026-04 unverdicted novelty 5.0

A hybrid confidence framework for LLM-based automatic short answer grading integrates model-based signals with aleatoric uncertainty from semantic clustering of responses and yields more reliable estimates than single...
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
cs.CL 2026-01 unverdicted novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Do Activation Verbalization Methods Convey Privileged Information?
cs.CL 2025-09 unverdicted novelty 5.0

Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
Mechanistic Interpretability Needs Philosophy
cs.CL 2025-06 unverdicted novelty 4.0

The paper claims that mechanistic interpretability needs philosophy as a partner to clarify concepts, refine methods, and navigate epistemic and ethical complexities in AI systems.
LLM-Safety Evaluations Lack Robustness
cs.CR 2025-03 unverdicted novelty 4.0

LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
Enhancing Adversarial Robustness in Network Intrusion Detection: A Layer-wise Adaptive Regularization Approach
cs.CR 2026-05 unverdicted novelty 3.0

LARAR enhances adversarial robustness in network intrusion detection by using layer-wise adaptive regularization and auxiliary classifiers, achieving 95.01% clean accuracy and improved defense against FGSM, PGD, and t...

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 23 Pith papers

[1]

Understanding intermediate layers using linear classifier probes.ICLR,

30 Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.ICLR,

work page
[2]

An introduction to systems biology: design principles of biological circuits

9, 14 Uri Alon. An introduction to systems biology: design principles of biological circuits. Chapman and Hal- l/CRC, 2019. 21 Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.CoRR, 2024. 9 Aryaman Arora, Dan Jurafsky, and Christopher Potts. Cau...

work page 2019
[3]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

31 Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, July 2015. 2 Nicholas Bai, Rahul Ajay Iyer, Tuomas Oikarinen, and Tsui-Wei Weng. Describe-and-dissect: Interpreting neurons in vi...

work page 2015
[4]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

13, 15 Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big?ACM FAccT, March 2021. 11 Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning t...

work page 2021
[5]

Curve detectors

21 Nick Cammarata, Gabriel Goh, Shan Carter, Ludwig Schubert, Michael Petrov, and Chris Olah. Curve detectors. Distill, June 2020. 10, 15, 21, 30, 32 Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea Voss, Ludwig Schubert, and Chris Olah. Curve circuits. Distill, 2021. 10, 21, 30, 32 StevenCao, VictorSanh, andAlexanderM.Rush. Low-complexityprobingviafindi...

work page 2020
[6]

Going beyond neural network feature similarity: The network feature complexity and its interpretation using category theory.CoRR, November 2023a

25 38 Under review as submission to TMLR Yiting Chen, Zhanpeng Zhou, and Junchi Yan. Going beyond neural network feature similarity: The network feature complexity and its interpretation using category theory.CoRR, November 2023a. 11 Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet. Dynamical versus bayesian phase transitions in a toy...

work page 2017
[7]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

30 Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability.NeurIPS, 2023. 18, 19, 22, 25, 30 Ian C. Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: a unified framework for model explanation. J. Mach. Learn. Res., January 2021. 2...

work page 2023
[8]

Challenges with unsupervised llm knowledge discovery.CoRR, 2023

34 Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, and Rohin Shah. Challenges with unsupervised llm knowledge discovery.CoRR, 2023. 14 Amir Feder, Nadav Oved, Uri Shalit, and Roi Reichart. Causalm: Causal model explanation through counterfactual language models.Computational Linguistics, May 2021. 18 Jiahai Feng an...

work page 2023
[9]

Let’s agree to agree: Neural networks share classification order on real datasets.ICML, 2020

11 Guy Hacohen, Leshem Choshen, and Daphna Weinshall. Let’s agree to agree: Neural networks share classification order on real datasets.ICML, 2020. 11 Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model.NeurIPS, 2023. 10, 17, 22, 30 Michael Hanna, San...

work page 2020
[10]

A circuit for python docstrings in a 4-layer attention-only transformer.AI Alignment Forum, February 2023

15 Stefan Heimersheim and Jett. A circuit for python docstrings in a 4-layer attention-only transformer.AI Alignment Forum, February 2023. 22 Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors.EMNLP, October

work page 2023
[11]

Self-published, 2023a

9, 17, 22 Dan Hendrycks.Introduction to AI Safety, Ethics, and Society. Self-published, 2023a. 26 Dan Hendrycks. Natural selection favors ais over humans.CoRR, July 2023b. 24 42 Under review as submission to TMLR Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research.CoRR, June 2022. 1, 24, 25 Dan Hendrycks, Nicholas Carlini, John Schulman, and...

work page 2022
[12]

9 John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. NAACL HLT, June 2019. 14 Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations.CoRR, December 2018. 27 Jacob Hilton, Nick Cammarata, Shan C...

work page 2019
[13]

Distributed representations.Carnegie Mellon University, 1984

30 Geoffrey E Hinton. Distributed representations.Carnegie Mellon University, 1984. 27 Marius Hobbhahn. Marius’ alignment agenda, 2022. 30 Marius Hobbhahn and Lawrence Chan. Should we publish mechanistic interpretability research?AI Align- ment Forum, April 2023. 25 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza R...

work page 1984
[14]

Simulators

30 janus. Simulators. LessWrong, September 2022. 11, 12, 24, 34 Erik Jenner, Adrià Garriga-alonso, and Egor Zverev. A comparison of causal scrubbing, causal abstractions, and related methods.AI Alignment Forum, June 2023. 13, 18, 19 Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, and Stuart Russell. Evidence of learned look-ahead ...

work page 2022
[15]

An interpretability illusion for activation patching of arbitrary subspaces

28, 32 Georg Lange, Alex Makelov, and Neel Nanda. An interpretability illusion for activation patching of arbitrary subspaces. AI Alignment Forum, August 2023. 18 Anna Langedijk, Hosein Mohebbi, Gabriele Sarti, Willem Zuidema, and Jaap Jumelet. Decoderlens: Layer- wise interpretation of encoder-decoder transformers.CoRR, 2023. 15 Edmund Lau, Daniel Murfet...

work page 2023
[16]

Trojan detection in large language models: Insights from the trojan detection challenge.CoRR, April 2024

32 Narek Maloyan, Ekansh Verma, Bulat Nutfullin, and Bislan Ashinov. Trojan detection in large language models: Insights from the trojan detection challenge.CoRR, April 2024. 28 Giovanni Luca Marchetti, Christopher Hillar, Danica Kragic, and Sophia Sanborn. Harmonics of learning: Universal fourier features emerge in invariant networks.CoRR, December 2023....

work page 2024
[17]

Marshall and Jan H

19 Simon C. Marshall and Jan H. Kirchner. Understanding polysemanticity in neural networks through coding theory. CoRR, January 2024. 6, 27, 28 Mantas Mazeika, Andy Zou, Akul Arora, Pavel Pleskov, Dawn Song, Dan Hendrycks, Bo Li, and David Forsyth. How hard is trojan detection in dnns? fooling detectors with evasive trojans.CoRR, September

work page 2024
[18]

Copy suppression: Comprehensively understanding an attention head.CoRR, October 2023

29 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. Copy suppression: Comprehensively understanding an attention head.CoRR, October 2023. 10 ThomasMcGrath, AndreiKapishnikov, NenadTomašev, AdamPearce, MartinWattenberg, DemisHassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowledge in alphazero.PNA...

work page 2023
[19]

why should i trust you?

16 Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks.TMLR, August 2023. 1, 13, 25, 26, 27, 28, 29 50 Under review as submission to TMLR Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected ...

work page 2023
[20]

Interpretability creationism.The Gradient, 2023

11 Naomi Saphra. Interpretability creationism.The Gradient, 2023. 21 Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? CoRR, May 2023. 20 Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks.CoRR, July 2023. 6, 26, 27 51 Und...

work page 2023
[21]

Attribution patching outperforms automated circuit discovery

2 Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. CoRR, October 2023. 18, 19, 22 technicalities and Stag. Shallow review of live agendas in alignment & safety.LessWrong, 2023. 24 Max Tegmark and Steve Omohundro. Provably safe systems: the only path to controllable agi.CoRR, September 2023. 24, 29, 35...

work page 2023
[22]

Bert rediscovers the classical nlp pipeline.ACL, August 2019

15 Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline.ACL, August 2019. 14 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.CoRR, 2022. 21 Hannes Thurnherr and Jérémy Scheurer. Tracrb...

work page 2019
[23]

The clock and the pizza: Two stories in mechanistic explanation of neural networks.CoRR, 2023

27 Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks.CoRR, 2023. 22 Roland S. Zimmermann, Judy Borowski, Robert Geirhos, Matthias Bethge, Thomas S. A. Wallis, and Wieland Brendel. How well do feature visualizations support causal understanding of cnn activations? Ne...

work page 2023

[1] [1]

Understanding intermediate layers using linear classifier probes.ICLR,

30 Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.ICLR,

work page

[2] [2]

An introduction to systems biology: design principles of biological circuits

9, 14 Uri Alon. An introduction to systems biology: design principles of biological circuits. Chapman and Hal- l/CRC, 2019. 21 Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.CoRR, 2024. 9 Aryaman Arora, Dan Jurafsky, and Christopher Potts. Cau...

work page 2019

[3] [3]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

31 Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, July 2015. 2 Nicholas Bai, Rahul Ajay Iyer, Tuomas Oikarinen, and Tsui-Wei Weng. Describe-and-dissect: Interpreting neurons in vi...

work page 2015

[4] [4]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

13, 15 Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big?ACM FAccT, March 2021. 11 Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning t...

work page 2021

[5] [5]

Curve detectors

21 Nick Cammarata, Gabriel Goh, Shan Carter, Ludwig Schubert, Michael Petrov, and Chris Olah. Curve detectors. Distill, June 2020. 10, 15, 21, 30, 32 Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea Voss, Ludwig Schubert, and Chris Olah. Curve circuits. Distill, 2021. 10, 21, 30, 32 StevenCao, VictorSanh, andAlexanderM.Rush. Low-complexityprobingviafindi...

work page 2020

[6] [6]

Going beyond neural network feature similarity: The network feature complexity and its interpretation using category theory.CoRR, November 2023a

25 38 Under review as submission to TMLR Yiting Chen, Zhanpeng Zhou, and Junchi Yan. Going beyond neural network feature similarity: The network feature complexity and its interpretation using category theory.CoRR, November 2023a. 11 Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet. Dynamical versus bayesian phase transitions in a toy...

work page 2017

[7] [7]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

30 Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability.NeurIPS, 2023. 18, 19, 22, 25, 30 Ian C. Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: a unified framework for model explanation. J. Mach. Learn. Res., January 2021. 2...

work page 2023

[8] [8]

Challenges with unsupervised llm knowledge discovery.CoRR, 2023

34 Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, and Rohin Shah. Challenges with unsupervised llm knowledge discovery.CoRR, 2023. 14 Amir Feder, Nadav Oved, Uri Shalit, and Roi Reichart. Causalm: Causal model explanation through counterfactual language models.Computational Linguistics, May 2021. 18 Jiahai Feng an...

work page 2023

[9] [9]

Let’s agree to agree: Neural networks share classification order on real datasets.ICML, 2020

11 Guy Hacohen, Leshem Choshen, and Daphna Weinshall. Let’s agree to agree: Neural networks share classification order on real datasets.ICML, 2020. 11 Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model.NeurIPS, 2023. 10, 17, 22, 30 Michael Hanna, San...

work page 2020

[10] [10]

A circuit for python docstrings in a 4-layer attention-only transformer.AI Alignment Forum, February 2023

15 Stefan Heimersheim and Jett. A circuit for python docstrings in a 4-layer attention-only transformer.AI Alignment Forum, February 2023. 22 Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors.EMNLP, October

work page 2023

[11] [11]

Self-published, 2023a

9, 17, 22 Dan Hendrycks.Introduction to AI Safety, Ethics, and Society. Self-published, 2023a. 26 Dan Hendrycks. Natural selection favors ais over humans.CoRR, July 2023b. 24 42 Under review as submission to TMLR Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research.CoRR, June 2022. 1, 24, 25 Dan Hendrycks, Nicholas Carlini, John Schulman, and...

work page 2022

[12] [12]

9 John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. NAACL HLT, June 2019. 14 Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations.CoRR, December 2018. 27 Jacob Hilton, Nick Cammarata, Shan C...

work page 2019

[13] [13]

Distributed representations.Carnegie Mellon University, 1984

30 Geoffrey E Hinton. Distributed representations.Carnegie Mellon University, 1984. 27 Marius Hobbhahn. Marius’ alignment agenda, 2022. 30 Marius Hobbhahn and Lawrence Chan. Should we publish mechanistic interpretability research?AI Align- ment Forum, April 2023. 25 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza R...

work page 1984

[14] [14]

Simulators

30 janus. Simulators. LessWrong, September 2022. 11, 12, 24, 34 Erik Jenner, Adrià Garriga-alonso, and Egor Zverev. A comparison of causal scrubbing, causal abstractions, and related methods.AI Alignment Forum, June 2023. 13, 18, 19 Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, and Stuart Russell. Evidence of learned look-ahead ...

work page 2022

[15] [15]

An interpretability illusion for activation patching of arbitrary subspaces

28, 32 Georg Lange, Alex Makelov, and Neel Nanda. An interpretability illusion for activation patching of arbitrary subspaces. AI Alignment Forum, August 2023. 18 Anna Langedijk, Hosein Mohebbi, Gabriele Sarti, Willem Zuidema, and Jaap Jumelet. Decoderlens: Layer- wise interpretation of encoder-decoder transformers.CoRR, 2023. 15 Edmund Lau, Daniel Murfet...

work page 2023

[16] [16]

Trojan detection in large language models: Insights from the trojan detection challenge.CoRR, April 2024

32 Narek Maloyan, Ekansh Verma, Bulat Nutfullin, and Bislan Ashinov. Trojan detection in large language models: Insights from the trojan detection challenge.CoRR, April 2024. 28 Giovanni Luca Marchetti, Christopher Hillar, Danica Kragic, and Sophia Sanborn. Harmonics of learning: Universal fourier features emerge in invariant networks.CoRR, December 2023....

work page 2024

[17] [17]

Marshall and Jan H

19 Simon C. Marshall and Jan H. Kirchner. Understanding polysemanticity in neural networks through coding theory. CoRR, January 2024. 6, 27, 28 Mantas Mazeika, Andy Zou, Akul Arora, Pavel Pleskov, Dawn Song, Dan Hendrycks, Bo Li, and David Forsyth. How hard is trojan detection in dnns? fooling detectors with evasive trojans.CoRR, September

work page 2024

[18] [18]

Copy suppression: Comprehensively understanding an attention head.CoRR, October 2023

29 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. Copy suppression: Comprehensively understanding an attention head.CoRR, October 2023. 10 ThomasMcGrath, AndreiKapishnikov, NenadTomašev, AdamPearce, MartinWattenberg, DemisHassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowledge in alphazero.PNA...

work page 2023

[19] [19]

why should i trust you?

16 Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks.TMLR, August 2023. 1, 13, 25, 26, 27, 28, 29 50 Under review as submission to TMLR Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected ...

work page 2023

[20] [20]

Interpretability creationism.The Gradient, 2023

11 Naomi Saphra. Interpretability creationism.The Gradient, 2023. 21 Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? CoRR, May 2023. 20 Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks.CoRR, July 2023. 6, 26, 27 51 Und...

work page 2023

[21] [21]

Attribution patching outperforms automated circuit discovery

2 Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. CoRR, October 2023. 18, 19, 22 technicalities and Stag. Shallow review of live agendas in alignment & safety.LessWrong, 2023. 24 Max Tegmark and Steve Omohundro. Provably safe systems: the only path to controllable agi.CoRR, September 2023. 24, 29, 35...

work page 2023

[22] [22]

Bert rediscovers the classical nlp pipeline.ACL, August 2019

15 Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline.ACL, August 2019. 14 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.CoRR, 2022. 21 Hannes Thurnherr and Jérémy Scheurer. Tracrb...

work page 2019

[23] [23]

The clock and the pizza: Two stories in mechanistic explanation of neural networks.CoRR, 2023

27 Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks.CoRR, 2023. 22 Roland S. Zimmermann, Judy Borowski, Robert Geirhos, Matthias Bethge, Thomas S. A. Wallis, and Wieland Brendel. How well do feature visualizations support causal understanding of cnn activations? Ne...

work page 2023