pith. machine review for the scientific record. sign in

arxiv: 2605.11161 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Interpretability Can Be Actionable

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords interpretabilityactionabilityevaluation criteriadeep neural networkspractical impactexplainable AI
0
0 comments X

The pith

Interpretability research advances practical impact when evaluated on actionability rather than explanation alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that rapid growth in interpretability methods for deep neural networks has not produced corresponding real-world effects. The authors claim the central gap is evaluation criteria rather than new technical approaches. They define actionability as the degree to which explanations support specific decisions and interventions outside research settings. This definition rests on two dimensions: concreteness of the provided insights and their validation through external tests. The work identifies current barriers to impact, names five domains where interpretability can provide unique leverage, and supplies a framework whose criteria align directly with those practical outcomes.

Core claim

Interpretability should be evaluated by actionability, defined along concreteness and validation dimensions, to enable concrete decisions and interventions beyond research. This criterion addresses barriers that currently prevent translation to practice, identifies five domains of unique leverage, and supports a framework with outcomes-oriented assessment.

What carries the argument

Actionability as evaluation criterion, measured along concreteness of insights and validation through real-world tests and interventions.

If this is right

  • Actionability criteria can be applied to five domains where interpretability offers unique leverage for practical outcomes.
  • A framework with evaluation criteria aligned to concrete interventions guides research toward measurable results.
  • Actionability becomes a core objective that complements rather than replaces exploratory interpretability work.
  • Barriers such as lack of links to decisions can be reduced by requiring validation of insights outside research settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption could shift publication incentives toward studies that demonstrate actual interventions in applied settings.
  • Researchers might form closer partnerships with practitioners to test whether insights change deployment choices.
  • The same evaluation lens could be tested on other AI subfields that face similar translation gaps.

Load-bearing premise

Defining and enforcing actionability as an evaluation criterion will overcome barriers to real-world impact without requiring new technical methods or changes in research incentives.

What would settle it

A controlled study that applies actionability criteria to existing interpretability methods yet records no measurable increase in concrete decisions or interventions within any of the five named domains would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.11161 by Anja Reusch, Byron Wallace, Eric Wong, Fazl Barez, Hadas Orgad, Ian Tenney, Isabelle Lee, Marius Mosbach, Mor Geva, Naomi Saphra, Sarah Wiegreffe, Tal Haklay.

Figure 1
Figure 1. Figure 1: Actionability checklist for interpretability research. training practices, deployment decisions, or policy (Krish￾nan, 2020; Greenblatt et al., 2023; Potts, 2025), motivating calls to focus on clearly demonstrable outcomes beyond “understanding” itself (Haklay et al., 2025b; Upadhyay & Barez, 2025; Nanda et al., 2025; Barez, 2026). Our framing draws in part on discussions from the ICML 2025 work￾1 arXiv:26… view at source ↗
Figure 2
Figure 2. Figure 2: Five domains where interpretability offers unique leverage to drive concrete improvements. and so rarely adopt these methods, especially when simpler alternatives exist. The open-weights assumption. Most methods require di￾rect access to weights and activations, restricting applica￾bility to open-weight models. This creates a tension: in￾terpretability is often motivated by safety concerns around powerful … view at source ↗
Figure 3
Figure 3. Figure 3: demonstrates the different types of actionable interpretability work as spanned by the dimensions of actionability. CONCRETENESS VALIDATION LOW HIGH LOW HIGH High concreteness, high validation Precise specifications with validated utility and robust evidence. Low concreteness, low validation Directional insights that motivate future work but lack steps. High concreteness, low validation Concrete actions or… view at source ↗
read the original abstract

Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of this work has not translated into practical impact, raising questions about its relevance and utility. This position paper argues that the central missing ingredient is not new methods, but evaluation criteria: interpretability should be evaluated by actionability--the extent to which insights enable concrete decisions and interventions beyond interpretability research itself. We define actionable interpretability along two dimensions--concreteness and validation--and analyze the barriers currently preventing real-world impact. To address these barriers, we identify five domains where interpretability offers unique leverage and present a framework for actionable interpretability with evaluation criteria aligned with practical outcomes. Our goal is not to downplay exploratory research, but to establish actionability as a core objective of interpretability research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This position paper claims that interpretability research for deep neural networks suffers from limited practical impact not because of insufficient methods but due to missing evaluation criteria. It introduces 'actionability' as the primary criterion, defined along two dimensions of concreteness and validation. The paper analyzes current barriers to impact, identifies five domains where interpretability can offer unique leverage, and presents a framework for actionable interpretability with evaluation criteria aligned to practical outcomes. The goal is to promote actionability as a core objective alongside exploratory research.

Significance. The argument, if persuasive, has the potential to refocus the interpretability community on research that produces tangible decisions and interventions in real-world settings such as model deployment in critical systems. Credit is due for the structured approach to defining actionability and for specifying domains and a framework without proposing entirely new technical tools. This normative contribution could help bridge the gap between technical interpretability advances and their application, though its ultimate significance will depend on community uptake and further development of the ideas.

major comments (2)
  1. [Barriers analysis] The analysis of barriers does not include discussion of how actionability criteria would be enforced or incentivized within the research ecosystem, for example through changes to review processes or funding calls. This omission makes the assertion that evaluation criteria are the central missing ingredient difficult to evaluate, as the link to overcoming barriers is not fully articulated.
  2. [Framework for actionable interpretability] The proposed framework across five domains provides evaluation criteria but lacks concrete examples or case studies showing how actionability would lead to interventions beyond interpretability research. This weakens the central claim that such criteria will enable concrete decisions, as the effectiveness remains asserted rather than demonstrated even conceptually.
minor comments (2)
  1. The title 'Interpretability Can Be Actionable' is somewhat vague; a subtitle clarifying the focus on evaluation criteria would help.
  2. [Abstract] The abstract mentions 'five domains' but does not name them, which could be included to better orient the reader.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of our position paper to refocus interpretability research. We address each major comment below with clarifications and indicate where revisions have been made to strengthen the manuscript.

read point-by-point responses
  1. Referee: The analysis of barriers does not include discussion of how actionability criteria would be enforced or incentivized within the research ecosystem, for example through changes to review processes or funding calls. This omission makes the assertion that evaluation criteria are the central missing ingredient difficult to evaluate, as the link to overcoming barriers is not fully articulated.

    Authors: We agree that explicitly linking evaluation criteria to enforcement mechanisms would better articulate how actionability addresses barriers. The manuscript treats the absence of criteria as the foundational gap, positing that clear standards are a prerequisite for aligned incentives. To address the comment, we have added a concise paragraph to the barriers analysis section outlining potential pathways, such as incorporating actionability into conference review rubrics and funding priorities. This revision makes the connection more explicit while preserving the paper's focus on defining the criteria rather than prescribing ecosystem-wide policy changes. revision: partial

  2. Referee: The proposed framework across five domains provides evaluation criteria but lacks concrete examples or case studies showing how actionability would lead to interventions beyond interpretability research. This weakens the central claim that such criteria will enable concrete decisions, as the effectiveness remains asserted rather than demonstrated even conceptually.

    Authors: We acknowledge that illustrative examples would help demonstrate the framework's utility. As a position paper, the core contribution is the conceptual framework and aligned criteria rather than empirical case studies. To respond to this point, we have incorporated two brief hypothetical scenarios into the framework section—one in healthcare model auditing and one in autonomous vehicle safety—showing how the dimensions of concreteness and validation can guide specific interventions such as targeted retraining or policy adjustments. These additions illustrate the intended pathway from insights to decisions at a conceptual level. revision: yes

Circularity Check

0 steps flagged

No circularity: normative position paper with independent argument

full rationale

The paper is a position piece advancing a normative claim that interpretability research should prioritize actionability (defined via concreteness and validation) over new methods. No mathematical derivations, fitted parameters, predictions, or equations appear in the provided text. The central argument analyzes barriers and proposes a framework across five domains without reducing any step to self-definition, self-citation chains, or renaming of prior results by construction. Self-citations, if present, are not load-bearing for the core thesis, which stands as an independent recommendation rather than a derived quantity. This matches the default expectation for non-technical position papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The position rests on the domain assumption that current interpretability outputs rarely produce actionable interventions, plus the ad-hoc framing of actionability itself as the missing criterion. No free parameters or invented physical entities appear.

axioms (2)
  • domain assumption Interpretability aims to explain the behavior of deep neural networks but has not translated into practical impact.
    Stated in the opening of the abstract as the motivating premise.
  • ad hoc to paper Actionability can be defined along concreteness and validation dimensions and will address the identified barriers.
    Introduced as the proposed solution without prior empirical grounding in the abstract.
invented entities (1)
  • Actionable interpretability framework no independent evidence
    purpose: To align interpretability research with practical outcomes via evaluation criteria.
    Newly proposed construct that organizes the five domains and evaluation criteria.

pith-pipeline@v0.9.0 · 5468 in / 1381 out tokens · 60711 ms · 2026-05-13T06:06:16.767339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

184 extracted references · 184 canonical work pages · 7 internal anchors

  1. [1]

    Steering Language Models With Activation Engineering

    Steering Language Models with Activation Engineering , author=. arXiv preprint arXiv:2308.10248 , year=

  2. [2]

    AI Governance Initiative, University of Oxford , year =

    Automated Interpretability-Driven Model Auditing and Control: A Research Agenda , author =. AI Governance Initiative, University of Oxford , year =

  3. [3]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. arXiv preprint arXiv:2310.01405 , year=

  4. [4]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , series=

    Concept Bottleneck Models , author=. Proceedings of the 37th International Conference on Machine Learning (ICML) , series=. 2020 , url=

  5. [5]

    International Conference on Learning Representations (ICLR) , year=

    Post-hoc Concept Bottleneck Models , author=. International Conference on Learning Representations (ICLR) , year=

  6. [6]

    International Conference on Learning Representations (ICLR) , year=

    Label-Free Concept Bottleneck Models , author=. International Conference on Learning Representations (ICLR) , year=

  7. [7]

    Journal of Applied Logic , volume=

    Neural-Symbolic Computing: An Effective Methodology for Principled Integration of Machine Learning and Reasoning , author=. Journal of Applied Logic , volume=. 2019 , url=

  8. [8]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  9. [9]

    Proceedings of the 34th International Conference on Machine Learning (ICML) , series =

    Understanding Black-box Predictions via Influence Functions , author =. Proceedings of the 34th International Conference on Machine Learning (ICML) , series =. 2017 , publisher =

  10. [10]

    Proceedings of the 36th International Conference on Machine Learning (ICML) , series =

    Data Shapley: Equitable Valuation of Data for Machine Learning , author =. Proceedings of the 36th International Conference on Machine Learning (ICML) , series =. 2019 , publisher =

  11. [11]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Representer Point Selection for Explaining Deep Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  12. [12]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Estimating Training Data Influence by Tracing Gradient Descent , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  13. [13]

    Proceedings of the 38th International Conference on Machine Learning (ICML) , series =

    GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , series =. 2021 , publisher =

  14. [14]

    Proceedings of the 40th International Conference on Machine Learning (ICML) , series =

    TRAK: Attributing Model Behavior at Scale , author =. Proceedings of the 40th International Conference on Machine Learning (ICML) , series =. 2023 , publisher =

  15. [15]

    2025 , eprint=

    Can Interpretation Predict Behavior on Unseen Data? , author=. 2025 , eprint=

  16. [16]

    2025 , eprint=

    Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors , author=. 2025 , eprint=

  17. [17]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

    Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

  18. [18]

    Transactions of the Association for Computational Linguistics (TACL) , year =

    Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals , author =. Transactions of the Association for Computational Linguistics (TACL) , year =

  19. [19]

    Proceedings of the IEEE Symposium on Security and Privacy , year =

    Towards Making Systems Forget with Machine Unlearning , author =. Proceedings of the IEEE Symposium on Security and Privacy , year =. doi:10.1109/SP.2015.35 , url =

  20. [20]

    One Engine to Fuzz 'em All: Generic Language Processor Testing with Semantic Validation,

    Machine Unlearning , author =. Proceedings of the 42nd IEEE Symposium on Security and Privacy (SP) , year =. doi:10.1109/SP40001.2021.00019 , url =

  21. [21]

    Wired , year =

    Simonite, Tom , title =. Wired , year =

  22. [22]

    2025 , eprint=

    Defining and Characterizing Reward Hacking , author=. 2025 , eprint=

  23. [23]

    Nature , year=

    Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. Nature , year=

  24. [24]

    Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

    Algorithmic recourse: from counterfactual explanations to interventions , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

  25. [25]

    2022 , howpublished =

    TransformerLens , author =. 2022 , howpublished =

  26. [26]

    NNsight and

    Jaden Fried Fiotto. NNsight and. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  27. [27]

    Forty-second International Conference on Machine Learning,

    Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv. Forty-second International Conference on Machine Learning,. 2025 , url =

  28. [28]

    2025 , month = dec, day =

    Upadhyay, Shriyash and Barez, Fazl , title =. 2025 , month = dec, day =

  29. [29]

    2025 , month = dec, day =

    Barez, Fazl and Upadhyay, Shriyash , title =. 2025 , month = dec, day =

  30. [30]

    2025 , month = feb, day =

    Interpreting Evo 2: Arc Institute’s Next-Generation Genomic Foundation Model , author =. 2025 , month = feb, day =

  31. [31]

    2025 , eprint=

    REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space , author=. 2025 , eprint=

  32. [32]

    Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*) , pages=

    Actionable recourse in linear classification , author=. Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*) , pages=. 2019 , organization=

  33. [33]

    Nature , year=

    Highly accurate protein structure prediction with AlphaFold , author=. Nature , year=

  34. [34]

    doi: 10.18653/v1/2024.naacl-long.179

    Longpre, Shayne and Yauney, Gregory and Reif, Emily and Lee, Katherine and Roberts, Adam and Zoph, Barret and Zhou, Denny and Wei, Jason and Robinson, Kevin and Mimno, David and Ippolito, Daphne. A Pretrainer ' s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. Proceedings of the 2024 Conference of the North...

  35. [35]

    2025 , month =

    Enhancing Model Safety through Pretraining Data Filtering , author =. 2025 , month =

  36. [36]

    2025 , eprint=

    DataDecide: How to Predict Best Pretraining Data with Small Experiments , author=. 2025 , eprint=

  37. [37]

    Queue , volume=

    The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery , author=. Queue , volume=. 2018 , publisher=

  38. [38]

    Transactions of the Association for Computational Linguistics , volume=

    Aligning faithful interpretations with their social attribution , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

  39. [39]

    Artificial intelligence , volume=

    Explanation in artificial intelligence: Insights from the social sciences , author=. Artificial intelligence , volume=. 2019 , publisher=

  40. [40]

    Why should i trust you?

    "Why should i trust you?" Explaining the predictions of any classifier , author=. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

  41. [41]

    ACM Trans

    Singh, Ronal and Miller, Tim and Lyons, Henrietta and Sonenberg, Liz and Velloso, Eduardo and Vetere, Frank and Howe, Piers and Dourish, Paul , title =. ACM Trans. Interact. Intell. Syst. , month = dec, articleno =. 2023 , issue_date =. doi:10.1145/3579363 , abstract =

  42. [42]

    Towards A Rigorous Science of Interpretable Machine Learning

    Towards a rigorous science of interpretable machine learning , author=. arXiv preprint arXiv:1702.08608 , year=

  43. [43]

    Nature machine intelligence , volume=

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , author=. Nature machine intelligence , volume=. 2019 , publisher=

  44. [44]

    arXiv preprint arXiv:2305.16765 , year=

    Backpack language models , author=. arXiv preprint arXiv:2305.16765 , year=

  45. [45]

    annotator rationales

    Using “annotator rationales” to improve machine learning for text categorization , author=. Human language technologies 2007: The conference of the North American chapter of the association for computational linguistics; proceedings of the main conference , pages=

  46. [46]

    arXiv preprint arXiv:1911.03429 , year=

    ERASER: A benchmark to evaluate rationalized NLP models , author=. arXiv preprint arXiv:1911.03429 , year=

  47. [47]

    Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , pages=

    FACE: feasible and actionable counterfactual explanations , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , pages=

  48. [48]

    Machine Learning and Knowledge Extraction , volume=

    Actionable explainable AI (AxAI): a practical example with aggregation functions for adaptive classification and textual explanations for interpretable machine learning , author=. Machine Learning and Knowledge Extraction , volume=. 2022 , publisher=

  49. [49]

    arXiv preprint arXiv:2407.09516 , year=

    An Actionability Assessment Tool for Explainable AI , author=. arXiv preprint arXiv:2407.09516 , year=

  50. [50]

    arXiv preprint arXiv:1907.09615 , year=

    Towards realistic individual recourse and actionable explanations in black-box decision making systems , author=. arXiv preprint arXiv:1907.09615 , year=

  51. [51]

    European Conference on Information Retrieval , pages=

    Investigating the usage of formulae in mathematical answer retrieval , author=. European Conference on Information Retrieval , pages=. 2024 , organization=

  52. [52]

    2023 , journal=

    Attention Is Off By One , author=. 2023 , journal=

  53. [53]

    System Card: Claude Sonnet 4.5 , author =

  54. [54]

    arXiv preprint arXiv:2504.05294 , year=

    Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations , author=. arXiv preprint arXiv:2504.05294 , year=

  55. [55]

    ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

    Saes can improve unlearning: Dynamic sparse autoencoder guardrails for precision unlearning in llms , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

  56. [56]

    Forty-second International Conference on Machine Learning , url=

    Taming Knowledge Conflicts in Language Models , author=. Forty-second International Conference on Machine Learning , url=

  57. [57]

    arXiv preprint arXiv:2511.13653 , year=

    Weight-sparse transformers have interpretable circuits , author=. arXiv preprint arXiv:2511.13653 , year=

  58. [58]

    Forty-second International Conference on Machine Learning , year=

    Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors , author=. Forty-second International Conference on Machine Learning , year=

  59. [59]

    Journal of Geophysical Research: Machine Learning and Computation , volume=

    Leveraging sparse autoencoders to reveal interpretable features in geophysical models , author=. Journal of Geophysical Research: Machine Learning and Computation , volume=. 2025 , publisher=

  60. [60]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Precise in-parameter concept erasure in large language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  61. [61]

    CRISP: Persistent Concept Unlearning via Sparse Autoencoders

    CRISP: Persistent Concept Unlearning via Sparse Autoencoders , author=. arXiv preprint arXiv:2508.13650 , year=

  62. [62]

    Proceedings of the National Academy of Sciences , volume =

    Lisa Schut and Nenad Tomašev and Thomas McGrath and Demis Hassabis and Ulrich Paquet and Been Kim , title =. Proceedings of the National Academy of Sciences , volume =. 2025 , doi =

  63. [63]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

    BERT Rediscovers the Classical NLP Pipeline , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

  64. [64]

    A structural probe for finding syntax in word representations , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

  65. [65]

    2025 , howpublished =

    David Bau , title =. 2025 , howpublished =

  66. [66]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  67. [67]

    Hiba Ahsan and Arnab Sen Sharma and Silvio Amir and David Bau and Byron C Wallace , booktitle=

  68. [68]

    and Tsvetkov, Yulia

    Han, Xiaochuang and Wallace, Byron C. and Tsvetkov, Yulia. Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.492

  69. [69]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  70. [70]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Inferring functionality of attention heads from their parameters , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  71. [71]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in

    Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

  72. [72]

    International conference on machine learning , pages=

    Axiomatic attribution for deep networks , author=. International conference on machine learning , pages=. 2017 , organization=

  73. [73]

    Distill , year =

    Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , title =. Distill , year =

  74. [74]

    Broad Critiques of Interpretability Research , url=

    Stephen Casper , year=. Broad Critiques of Interpretability Research , url=

  75. [75]

    Against Almost Every Theory of Impact of Interpretability , url =

    Segerie, Charbel-Raphaël , urldate =. Against Almost Every Theory of Impact of Interpretability , url =

  76. [76]

    A Longlist of Theories of Impact for Interpretability , url =

    Nanda, Neel , date =. A Longlist of Theories of Impact for Interpretability , url =

  77. [77]

    How useful is mechanistic interpretability? , url =

    Ryan Greenblatt and Nanda, Neel and Buck and habryka , urldate =. How useful is mechanistic interpretability? , url =

  78. [78]

    A Pragmatic Vision for Interpretability , year =

    Neel Nanda and Josh Engels and Arthur Conmy and Senthooran Rajamanoharan and bilalchughtai and Callum McDougall and J. A Pragmatic Vision for Interpretability , year =

  79. [79]

    The Urgency of Interpretability , url=

    Amodei, Dario , year=. The Urgency of Interpretability , url=

  80. [80]

    Recommendations for Technical AI Safety Research Directions , url=

    Marks, Sam and Hase, Peter and. Recommendations for Technical AI Safety Research Directions , url=

Showing first 80 references.