arxiv: 2605.11161 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Interpretability Can Be Actionable

Hadas Orgad , Fazl Barez , Tal Haklay , Isabelle Lee , Marius Mosbach , Anja Reusch , Naomi Saphra , Byron Wallace

show 4 more authors

Sarah Wiegreffe Eric Wong Ian Tenney Mor Geva

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords interpretabilityactionabilityevaluation criteriadeep neural networkspractical impactexplainable AI

0 comments

The pith

Interpretability research advances practical impact when evaluated on actionability rather than explanation alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that rapid growth in interpretability methods for deep neural networks has not produced corresponding real-world effects. The authors claim the central gap is evaluation criteria rather than new technical approaches. They define actionability as the degree to which explanations support specific decisions and interventions outside research settings. This definition rests on two dimensions: concreteness of the provided insights and their validation through external tests. The work identifies current barriers to impact, names five domains where interpretability can provide unique leverage, and supplies a framework whose criteria align directly with those practical outcomes.

Core claim

Interpretability should be evaluated by actionability, defined along concreteness and validation dimensions, to enable concrete decisions and interventions beyond research. This criterion addresses barriers that currently prevent translation to practice, identifies five domains of unique leverage, and supports a framework with outcomes-oriented assessment.

What carries the argument

Actionability as evaluation criterion, measured along concreteness of insights and validation through real-world tests and interventions.

If this is right

Actionability criteria can be applied to five domains where interpretability offers unique leverage for practical outcomes.
A framework with evaluation criteria aligned to concrete interventions guides research toward measurable results.
Actionability becomes a core objective that complements rather than replaces exploratory interpretability work.
Barriers such as lack of links to decisions can be reduced by requiring validation of insights outside research settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption could shift publication incentives toward studies that demonstrate actual interventions in applied settings.
Researchers might form closer partnerships with practitioners to test whether insights change deployment choices.
The same evaluation lens could be tested on other AI subfields that face similar translation gaps.

Load-bearing premise

Defining and enforcing actionability as an evaluation criterion will overcome barriers to real-world impact without requiring new technical methods or changes in research incentives.

What would settle it

A controlled study that applies actionability criteria to existing interpretability methods yet records no measurable increase in concrete decisions or interventions within any of the five named domains would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.11161 by Anja Reusch, Byron Wallace, Eric Wong, Fazl Barez, Hadas Orgad, Ian Tenney, Isabelle Lee, Marius Mosbach, Mor Geva, Naomi Saphra, Sarah Wiegreffe, Tal Haklay.

**Figure 1.** Figure 1: Actionability checklist for interpretability research. training practices, deployment decisions, or policy (Krishnan, 2020; Greenblatt et al., 2023; Potts, 2025), motivating calls to focus on clearly demonstrable outcomes beyond “understanding” itself (Haklay et al., 2025b; Upadhyay & Barez, 2025; Nanda et al., 2025; Barez, 2026). Our framing draws in part on discussions from the ICML 2025 work1 arXiv:26… view at source ↗

**Figure 2.** Figure 2: Five domains where interpretability offers unique leverage to drive concrete improvements. and so rarely adopt these methods, especially when simpler alternatives exist. The open-weights assumption. Most methods require direct access to weights and activations, restricting applicability to open-weight models. This creates a tension: interpretability is often motivated by safety concerns around powerful … view at source ↗

**Figure 3.** Figure 3: demonstrates the different types of actionable interpretability work as spanned by the dimensions of actionability. CONCRETENESS VALIDATION LOW HIGH LOW HIGH High concreteness, high validation Precise specifications with validated utility and robust evidence. Low concreteness, low validation Directional insights that motivate future work but lack steps. High concreteness, low validation Concrete actions or… view at source ↗

read the original abstract

Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of this work has not translated into practical impact, raising questions about its relevance and utility. This position paper argues that the central missing ingredient is not new methods, but evaluation criteria: interpretability should be evaluated by actionability--the extent to which insights enable concrete decisions and interventions beyond interpretability research itself. We define actionable interpretability along two dimensions--concreteness and validation--and analyze the barriers currently preventing real-world impact. To address these barriers, we identify five domains where interpretability offers unique leverage and present a framework for actionable interpretability with evaluation criteria aligned with practical outcomes. Our goal is not to downplay exploratory research, but to establish actionability as a core objective of interpretability research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core claim is that interpretability needs actionability criteria (concreteness plus validation) more than new methods, with a five-domain framework to guide it.

read the letter

The main point here is that interpretability research has grown fast but often fails to produce usable outputs, and the fix is to judge work by whether it enables concrete decisions rather than by method novelty alone. The authors define actionability along two axes—concreteness and validation—and map it onto five domains where interpretability could actually drive interventions like model deployment or safety checks. That framing is the clearest new element; it gives a simple structure for thinking about impact that prior calls for practical XAI have not laid out this explicitly. The paper does a solid job naming the barriers (publication incentives, lack of validation loops) and showing how the criteria could align research with real outcomes without dismissing exploratory work. The writing stays direct and the logic holds together from the abstract through the framework. The soft spot is the missing link between defining these criteria and seeing them adopted. The argument assumes that clearer evaluation standards will shift what counts as good research, but it offers no mechanism for changing incentives like conference acceptance or citation patterns that reward novelty over validated applications. Without examples or even a sketch of how this would work in practice, the causal step from framework to impact stays untested. This is for interpretability researchers who already worry about the translation gap and want a way to organize their own evaluation or review criteria. It is not for people seeking new algorithms or empirical results. The thinking is coherent and engages the literature honestly, so it deserves a serious referee even if revisions will focus on strengthening the adoption argument.

Referee Report

2 major / 2 minor

Summary. This position paper claims that interpretability research for deep neural networks suffers from limited practical impact not because of insufficient methods but due to missing evaluation criteria. It introduces 'actionability' as the primary criterion, defined along two dimensions of concreteness and validation. The paper analyzes current barriers to impact, identifies five domains where interpretability can offer unique leverage, and presents a framework for actionable interpretability with evaluation criteria aligned to practical outcomes. The goal is to promote actionability as a core objective alongside exploratory research.

Significance. The argument, if persuasive, has the potential to refocus the interpretability community on research that produces tangible decisions and interventions in real-world settings such as model deployment in critical systems. Credit is due for the structured approach to defining actionability and for specifying domains and a framework without proposing entirely new technical tools. This normative contribution could help bridge the gap between technical interpretability advances and their application, though its ultimate significance will depend on community uptake and further development of the ideas.

major comments (2)

[Barriers analysis] The analysis of barriers does not include discussion of how actionability criteria would be enforced or incentivized within the research ecosystem, for example through changes to review processes or funding calls. This omission makes the assertion that evaluation criteria are the central missing ingredient difficult to evaluate, as the link to overcoming barriers is not fully articulated.
[Framework for actionable interpretability] The proposed framework across five domains provides evaluation criteria but lacks concrete examples or case studies showing how actionability would lead to interventions beyond interpretability research. This weakens the central claim that such criteria will enable concrete decisions, as the effectiveness remains asserted rather than demonstrated even conceptually.

minor comments (2)

The title 'Interpretability Can Be Actionable' is somewhat vague; a subtitle clarifying the focus on evaluation criteria would help.
[Abstract] The abstract mentions 'five domains' but does not name them, which could be included to better orient the reader.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of our position paper to refocus interpretability research. We address each major comment below with clarifications and indicate where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: The analysis of barriers does not include discussion of how actionability criteria would be enforced or incentivized within the research ecosystem, for example through changes to review processes or funding calls. This omission makes the assertion that evaluation criteria are the central missing ingredient difficult to evaluate, as the link to overcoming barriers is not fully articulated.

Authors: We agree that explicitly linking evaluation criteria to enforcement mechanisms would better articulate how actionability addresses barriers. The manuscript treats the absence of criteria as the foundational gap, positing that clear standards are a prerequisite for aligned incentives. To address the comment, we have added a concise paragraph to the barriers analysis section outlining potential pathways, such as incorporating actionability into conference review rubrics and funding priorities. This revision makes the connection more explicit while preserving the paper's focus on defining the criteria rather than prescribing ecosystem-wide policy changes. revision: partial
Referee: The proposed framework across five domains provides evaluation criteria but lacks concrete examples or case studies showing how actionability would lead to interventions beyond interpretability research. This weakens the central claim that such criteria will enable concrete decisions, as the effectiveness remains asserted rather than demonstrated even conceptually.

Authors: We acknowledge that illustrative examples would help demonstrate the framework's utility. As a position paper, the core contribution is the conceptual framework and aligned criteria rather than empirical case studies. To respond to this point, we have incorporated two brief hypothetical scenarios into the framework section—one in healthcare model auditing and one in autonomous vehicle safety—showing how the dimensions of concreteness and validation can guide specific interventions such as targeted retraining or policy adjustments. These additions illustrate the intended pathway from insights to decisions at a conceptual level. revision: yes

Circularity Check

0 steps flagged

No circularity: normative position paper with independent argument

full rationale

The paper is a position piece advancing a normative claim that interpretability research should prioritize actionability (defined via concreteness and validation) over new methods. No mathematical derivations, fitted parameters, predictions, or equations appear in the provided text. The central argument analyzes barriers and proposes a framework across five domains without reducing any step to self-definition, self-citation chains, or renaming of prior results by construction. Self-citations, if present, are not load-bearing for the core thesis, which stands as an independent recommendation rather than a derived quantity. This matches the default expectation for non-technical position papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The position rests on the domain assumption that current interpretability outputs rarely produce actionable interventions, plus the ad-hoc framing of actionability itself as the missing criterion. No free parameters or invented physical entities appear.

axioms (2)

domain assumption Interpretability aims to explain the behavior of deep neural networks but has not translated into practical impact.
Stated in the opening of the abstract as the motivating premise.
ad hoc to paper Actionability can be defined along concreteness and validation dimensions and will address the identified barriers.
Introduced as the proposed solution without prior empirical grounding in the abstract.

invented entities (1)

Actionable interpretability framework no independent evidence
purpose: To align interpretability research with practical outcomes via evaluation criteria.
Newly proposed construct that organizes the five domains and evaluation criteria.

pith-pipeline@v0.9.0 · 5468 in / 1381 out tokens · 60711 ms · 2026-05-13T06:06:16.767339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

184 extracted references · 184 canonical work pages · 7 internal anchors

[1]

Steering Language Models With Activation Engineering

Steering Language Models with Activation Engineering , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

AI Governance Initiative, University of Oxford , year =

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda , author =. AI Governance Initiative, University of Oxford , year =

work page
[3]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Proceedings of the 37th International Conference on Machine Learning (ICML) , series=

Concept Bottleneck Models , author=. Proceedings of the 37th International Conference on Machine Learning (ICML) , series=. 2020 , url=

work page 2020
[5]

International Conference on Learning Representations (ICLR) , year=

Post-hoc Concept Bottleneck Models , author=. International Conference on Learning Representations (ICLR) , year=

work page
[6]

International Conference on Learning Representations (ICLR) , year=

Label-Free Concept Bottleneck Models , author=. International Conference on Learning Representations (ICLR) , year=

work page
[7]

Journal of Applied Logic , volume=

Neural-Symbolic Computing: An Effective Methodology for Principled Integration of Machine Learning and Reasoning , author=. Journal of Applied Logic , volume=. 2019 , url=

work page 2019
[8]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[9]

Proceedings of the 34th International Conference on Machine Learning (ICML) , series =

Understanding Black-box Predictions via Influence Functions , author =. Proceedings of the 34th International Conference on Machine Learning (ICML) , series =. 2017 , publisher =

work page 2017
[10]

Proceedings of the 36th International Conference on Machine Learning (ICML) , series =

Data Shapley: Equitable Valuation of Data for Machine Learning , author =. Proceedings of the 36th International Conference on Machine Learning (ICML) , series =. 2019 , publisher =

work page 2019
[11]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Representer Point Selection for Explaining Deep Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[12]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Estimating Training Data Influence by Tracing Gradient Descent , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[13]

Proceedings of the 38th International Conference on Machine Learning (ICML) , series =

GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , series =. 2021 , publisher =

work page 2021
[14]

Proceedings of the 40th International Conference on Machine Learning (ICML) , series =

TRAK: Attributing Model Behavior at Scale , author =. Proceedings of the 40th International Conference on Machine Learning (ICML) , series =. 2023 , publisher =

work page 2023
[15]

2025 , eprint=

Can Interpretation Predict Behavior on Unseen Data? , author=. 2025 , eprint=

work page 2025
[16]

2025 , eprint=

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors , author=. 2025 , eprint=

work page 2025
[17]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[18]

Transactions of the Association for Computational Linguistics (TACL) , year =

Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals , author =. Transactions of the Association for Computational Linguistics (TACL) , year =

work page
[19]

Proceedings of the IEEE Symposium on Security and Privacy , year =

Towards Making Systems Forget with Machine Unlearning , author =. Proceedings of the IEEE Symposium on Security and Privacy , year =. doi:10.1109/SP.2015.35 , url =

work page doi:10.1109/sp.2015.35 2015
[20]

One Engine to Fuzz 'em All: Generic Language Processor Testing with Semantic Validation,

Machine Unlearning , author =. Proceedings of the 42nd IEEE Symposium on Security and Privacy (SP) , year =. doi:10.1109/SP40001.2021.00019 , url =

work page doi:10.1109/sp40001.2021.00019 2021
[21]

Wired , year =

Simonite, Tom , title =. Wired , year =

work page
[22]

2025 , eprint=

Defining and Characterizing Reward Hacking , author=. 2025 , eprint=

work page 2025
[23]

Nature , year=

Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. Nature , year=

work page
[24]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

Algorithmic recourse: from counterfactual explanations to interventions , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

work page 2021
[25]

2022 , howpublished =

TransformerLens , author =. 2022 , howpublished =

work page 2022
[26]

NNsight and

Jaden Fried Fiotto. NNsight and. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025
[27]

Forty-second International Conference on Machine Learning,

Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv. Forty-second International Conference on Machine Learning,. 2025 , url =

work page 2025
[28]

2025 , month = dec, day =

Upadhyay, Shriyash and Barez, Fazl , title =. 2025 , month = dec, day =

work page 2025
[29]

2025 , month = dec, day =

Barez, Fazl and Upadhyay, Shriyash , title =. 2025 , month = dec, day =

work page 2025
[30]

2025 , month = feb, day =

Interpreting Evo 2: Arc Institute’s Next-Generation Genomic Foundation Model , author =. 2025 , month = feb, day =

work page 2025
[31]

2025 , eprint=

REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space , author=. 2025 , eprint=

work page 2025
[32]

Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*) , pages=

Actionable recourse in linear classification , author=. Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*) , pages=. 2019 , organization=

work page 2019
[33]

Nature , year=

Highly accurate protein structure prediction with AlphaFold , author=. Nature , year=

work page
[34]

doi: 10.18653/v1/2024.naacl-long.179

Longpre, Shayne and Yauney, Gregory and Reif, Emily and Lee, Katherine and Roberts, Adam and Zoph, Barret and Zhou, Denny and Wei, Jason and Robinson, Kevin and Mimno, David and Ippolito, Daphne. A Pretrainer ' s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. Proceedings of the 2024 Conference of the North...

work page doi:10.18653/v1/2024.naacl-long.179 2024
[35]

2025 , month =

Enhancing Model Safety through Pretraining Data Filtering , author =. 2025 , month =

work page 2025
[36]

2025 , eprint=

DataDecide: How to Predict Best Pretraining Data with Small Experiments , author=. 2025 , eprint=

work page 2025
[37]

Queue , volume=

The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery , author=. Queue , volume=. 2018 , publisher=

work page 2018
[38]

Transactions of the Association for Computational Linguistics , volume=

Aligning faithful interpretations with their social attribution , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

work page 2021
[39]

Artificial intelligence , volume=

Explanation in artificial intelligence: Insights from the social sciences , author=. Artificial intelligence , volume=. 2019 , publisher=

work page 2019
[40]

Why should i trust you?

"Why should i trust you?" Explaining the predictions of any classifier , author=. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

work page
[41]

ACM Trans

Singh, Ronal and Miller, Tim and Lyons, Henrietta and Sonenberg, Liz and Velloso, Eduardo and Vetere, Frank and Howe, Piers and Dourish, Paul , title =. ACM Trans. Interact. Intell. Syst. , month = dec, articleno =. 2023 , issue_date =. doi:10.1145/3579363 , abstract =

work page doi:10.1145/3579363 2023
[42]

Towards A Rigorous Science of Interpretable Machine Learning

Towards a rigorous science of interpretable machine learning , author=. arXiv preprint arXiv:1702.08608 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Nature machine intelligence , volume=

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , author=. Nature machine intelligence , volume=. 2019 , publisher=

work page 2019
[44]

arXiv preprint arXiv:2305.16765 , year=

Backpack language models , author=. arXiv preprint arXiv:2305.16765 , year=

work page arXiv
[45]

annotator rationales

Using “annotator rationales” to improve machine learning for text categorization , author=. Human language technologies 2007: The conference of the North American chapter of the association for computational linguistics; proceedings of the main conference , pages=

work page 2007
[46]

arXiv preprint arXiv:1911.03429 , year=

ERASER: A benchmark to evaluate rationalized NLP models , author=. arXiv preprint arXiv:1911.03429 , year=

work page arXiv 1911
[47]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , pages=

FACE: feasible and actionable counterfactual explanations , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , pages=

work page
[48]

Machine Learning and Knowledge Extraction , volume=

Actionable explainable AI (AxAI): a practical example with aggregation functions for adaptive classification and textual explanations for interpretable machine learning , author=. Machine Learning and Knowledge Extraction , volume=. 2022 , publisher=

work page 2022
[49]

arXiv preprint arXiv:2407.09516 , year=

An Actionability Assessment Tool for Explainable AI , author=. arXiv preprint arXiv:2407.09516 , year=

work page arXiv
[50]

arXiv preprint arXiv:1907.09615 , year=

Towards realistic individual recourse and actionable explanations in black-box decision making systems , author=. arXiv preprint arXiv:1907.09615 , year=

work page arXiv 1907
[51]

European Conference on Information Retrieval , pages=

Investigating the usage of formulae in mathematical answer retrieval , author=. European Conference on Information Retrieval , pages=. 2024 , organization=

work page 2024
[52]

2023 , journal=

Attention Is Off By One , author=. 2023 , journal=

work page 2023
[53]

System Card: Claude Sonnet 4.5 , author =

work page
[54]

arXiv preprint arXiv:2504.05294 , year=

Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations , author=. arXiv preprint arXiv:2504.05294 , year=

work page arXiv
[55]

ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

Saes can improve unlearning: Dynamic sparse autoencoder guardrails for precision unlearning in llms , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

work page 2025
[56]

Forty-second International Conference on Machine Learning , url=

Taming Knowledge Conflicts in Language Models , author=. Forty-second International Conference on Machine Learning , url=

work page
[57]

arXiv preprint arXiv:2511.13653 , year=

Weight-sparse transformers have interpretable circuits , author=. arXiv preprint arXiv:2511.13653 , year=

work page arXiv
[58]

Forty-second International Conference on Machine Learning , year=

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors , author=. Forty-second International Conference on Machine Learning , year=

work page
[59]

Journal of Geophysical Research: Machine Learning and Computation , volume=

Leveraging sparse autoencoders to reveal interpretable features in geophysical models , author=. Journal of Geophysical Research: Machine Learning and Computation , volume=. 2025 , publisher=

work page 2025
[60]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Precise in-parameter concept erasure in large language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[61]

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

CRISP: Persistent Concept Unlearning via Sparse Autoencoders , author=. arXiv preprint arXiv:2508.13650 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Proceedings of the National Academy of Sciences , volume =

Lisa Schut and Nenad Tomašev and Thomas McGrath and Demis Hassabis and Ulrich Paquet and Been Kim , title =. Proceedings of the National Academy of Sciences , volume =. 2025 , doi =

work page 2025
[63]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

BERT Rediscovers the Classical NLP Pipeline , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[64]

A structural probe for finding syntax in word representations , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019
[65]

2025 , howpublished =

David Bau , title =. 2025 , howpublished =

work page 2025
[66]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[67]

Hiba Ahsan and Arnab Sen Sharma and Silvio Amir and David Bau and Byron C Wallace , booktitle=

work page
[68]

and Tsvetkov, Yulia

Han, Xiaochuang and Wallace, Byron C. and Tsvetkov, Yulia. Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.492

work page doi:10.18653/v1/2020.acl-main.492 2020
[69]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[70]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Inferring functionality of attention heads from their parameters , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[71]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

work page 2023
[72]

International conference on machine learning , pages=

Axiomatic attribution for deep networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[73]

Distill , year =

Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , title =. Distill , year =

work page
[74]

Broad Critiques of Interpretability Research , url=

Stephen Casper , year=. Broad Critiques of Interpretability Research , url=

work page
[75]

Against Almost Every Theory of Impact of Interpretability , url =

Segerie, Charbel-Raphaël , urldate =. Against Almost Every Theory of Impact of Interpretability , url =

work page
[76]

A Longlist of Theories of Impact for Interpretability , url =

Nanda, Neel , date =. A Longlist of Theories of Impact for Interpretability , url =

work page
[77]

How useful is mechanistic interpretability? , url =

Ryan Greenblatt and Nanda, Neel and Buck and habryka , urldate =. How useful is mechanistic interpretability? , url =

work page
[78]

A Pragmatic Vision for Interpretability , year =

Neel Nanda and Josh Engels and Arthur Conmy and Senthooran Rajamanoharan and bilalchughtai and Callum McDougall and J. A Pragmatic Vision for Interpretability , year =

work page
[79]

The Urgency of Interpretability , url=

Amodei, Dario , year=. The Urgency of Interpretability , url=

work page
[80]

Recommendations for Technical AI Safety Research Directions , url=

Marks, Sam and Hase, Peter and. Recommendations for Technical AI Safety Research Directions , url=

work page

Showing first 80 references.