pith. sign in

arxiv: 2605.08012 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

Pith reviewed 2026-05-11 02:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords mechanistic interpretabilitycausal claimsidentification assumptionsauditdisclosure normvalidation metricsneural network internals
0
0 comments X

The pith

Causal claims about neural network internals require explicit identification assumptions in mechanistic interpretability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When papers in mechanistic interpretability describe specific model components as causing certain behaviors, those descriptions depend on assumptions that turn their measurements into evidence of causation. An audit of ten papers spanning four common methods found that none provided a dedicated section stating these assumptions. Instead, metrics such as faithfulness, completeness, monosemanticity, and ablation effects were presented as sufficient to support the causal claims. The authors recommend a disclosure practice in which researchers must state whether their claim is causal, name the identification strategy used, list the assumptions, highlight at least one key assumption, and explain how the conclusions would change if any assumption does not hold. This approach separates the role of validation from the requirements of causal identification.

Core claim

The central claim is that causal vocabulary in mechanistic interpretability, including references to circuits, mediators, and causal abstraction, necessitates the disclosure of identification assumptions. A purposive audit across ten papers in four methodological strands shows the absence of any dedicated identification-assumptions section, with validation metrics routinely substituted for explicit causal identification. The proposed norm requires papers to declare the claim as causal, identify the strategy, enumerate assumptions, stress at least one, and discuss sensitivity to assumption failure.

What carries the argument

Identification assumptions that link validation metrics to causal conclusions about model structures.

If this is right

  • Causal conclusions in the field would become easier to evaluate for their dependence on unstated premises.
  • Validation metrics alone would no longer be accepted as direct support for causal relations.
  • Papers would need to address how results change when key assumptions are violated.
  • The distinction between correlation captured by metrics and causation would be made explicit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adopting this norm might lead to more cautious interpretations of interpretability findings in downstream applications.
  • The same requirement could be extended to causal claims in other subfields of machine learning that rely on internal model analysis.

Load-bearing premise

The purposive selection of ten papers and the two-coder audit of thirty items are sufficient to demonstrate a recurring pattern of missing identification-assumption disclosure across mechanistic interpretability research.

What would settle it

A larger or random sample of papers that includes several with explicit sections on identification assumptions for causal claims would undermine the reported pattern.

Figures

Figures reproduced from arXiv: 2605.08012 by Fengming Liu, Zezheng Lin.

Figure 1
Figure 1. Figure 1: Validation metrics and identification assumptions are not interchangeable. Validation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: n = 30 two-human-coder audit (human inter-rater reliability). Left: Dim D under strict adjudication, by methodological strand. Sparse autoencoders (6/6) and causal abstraction (3/3) substitute validation for identification at full rate; activation patching shows the lowest rate (2/7), reflecting partial protective effect of the illusion-and-failure-modes literature within that strand. Right: Dim D = Yes oc… view at source ↗
read the original abstract

Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that mechanistic interpretability research increasingly uses causal vocabulary (circuits, mediators, causal abstraction) but fails to disclose the identification assumptions needed to support such claims. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions sections and a pattern of substituting validation metrics (faithfulness, completeness, monosemanticity, alignment, ablation effects) for causal evidence. A two-coder audit on n=30 reproduces the directional finding, though counts are coding-rule sensitive. The authors propose a five-part disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions change if assumptions fail.

Significance. If the pattern documented in the audit generalizes, the paper's call for explicit identification disclosure could meaningfully improve rigor in mechanistic interpretability by clarifying the boundary between correlational and causal claims. The constructive, actionable disclosure norm is a particular strength that provides a concrete template the community could adopt without requiring new methods.

major comments (2)
  1. [Audit description] Audit of 10 papers (abstract and main audit section): the purposive selection of only 10 papers across four strands, combined with the coding-sensitive nature of the n=30 check, is insufficient to establish a field-wide pattern. Without explicit inclusion/exclusion criteria and full details on how papers were chosen, the evidence cannot rule out selection bias, weakening support for the general recommendation.
  2. [Audit results] Validation-metric substitution pattern (abstract and results): the claim that metrics such as faithfulness or monosemanticity are reported as causal support rests on checking for dedicated sections. The audit does not appear to verify whether assumptions are discussed elsewhere in the papers (methods, discussion, or supplements), so absence of a dedicated section does not demonstrate that assumptions are unstated or unconsidered.
minor comments (1)
  1. [Proposed norm] The five-part disclosure norm is clearly stated but would be strengthened by a short worked example showing how it applies to a concrete claim (e.g., a circuit or mediator finding).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The feedback highlights important methodological considerations for the audit that supports our position. We address each major comment below and outline specific revisions to improve transparency and rigor without altering the core argument for a disclosure norm.

read point-by-point responses
  1. Referee: Audit of 10 papers (abstract and main audit section): the purposive selection of only 10 papers across four strands, combined with the coding-sensitive nature of the n=30 check, is insufficient to establish a field-wide pattern. Without explicit inclusion/exclusion criteria and full details on how papers were chosen, the evidence cannot rule out selection bias, weakening support for the general recommendation.

    Authors: We agree that the audit is purposive rather than a systematic or representative survey and does not establish field-wide prevalence. Its purpose is to document the issue across prominent examples from four methodological strands, motivating the proposed norm. The n=30 check is a limited robustness check whose counts are indeed sensitive to coding rules, as already noted in the manuscript. To address selection bias concerns, we will revise the manuscript to include: (1) explicit inclusion/exclusion criteria for the 10 papers, (2) a step-by-step description of the selection process, and (3) the full coding protocol and inter-coder agreement details for the n=30 audit in an expanded appendix. These additions will allow readers to evaluate the audit's scope and limitations directly. revision: yes

  2. Referee: Validation-metric substitution pattern (abstract and results): the claim that metrics such as faithfulness or monosemanticity are reported as causal support rests on checking for dedicated sections. The audit does not appear to verify whether assumptions are discussed elsewhere in the papers (methods, discussion, or supplements), so absence of a dedicated section does not demonstrate that assumptions are unstated or unconsidered.

    Authors: We acknowledge that the initial audit emphasized the absence of dedicated identification-assumptions sections, which directly relates to the proposed norm. However, to more rigorously support the substitution pattern, we will re-examine the full text (including methods, results, discussion, and supplements) of the audited papers for any explicit discussion of identification assumptions, causal identification strategies, or how validation metrics support causal claims. The revised audit will report whether such discussions appear outside dedicated sections and adjust the characterization of current practices accordingly. This will provide a more complete assessment of whether assumptions are merely unsectioned or genuinely unstated. revision: yes

Circularity Check

0 steps flagged

Position paper's central claim rests on external audit of other papers with no self-referential derivation

full rationale

The paper is a position piece whose load-bearing step is an observational audit of 10 external papers (plus a secondary n=30 coding exercise) documenting the absence of dedicated identification-assumptions sections and the substitution of validation metrics for causal identification. No equations, fitted parameters, or derivations appear; the proposal of a disclosure norm follows directly from the external pattern observed rather than reducing to any quantity defined by the authors' own modeling choices. No self-citations are invoked as load-bearing support for the uniqueness or necessity of the claim. This is the standard case of a self-contained argument against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The argument applies standard causal-inference requirements to interpretability literature without introducing new parameters, axioms beyond domain assumptions, or invented entities.

axioms (1)
  • domain assumption Causal claims require explicit identification assumptions to be validly supported by data or models.
    This premise is invoked throughout the abstract as the basis for criticizing current practices and proposing the disclosure norm.

pith-pipeline@v0.9.0 · 5443 in / 1250 out tokens · 54916 ms · 2026-05-11T02:39:10.973500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    Angrist, Guido W

    Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects using instrumental variables. JASA, 91(434):444--455, 1996

  2. [2]

    Towards monosemanticity: decomposing language models with dictionary learning

    Trenton Bricken et al. Towards monosemanticity: decomposing language models with dictionary learning. Anthropic Transformer Circuits Thread, 2023

  3. [3]

    Measuring the reliability of causal probing methods

    Marc Canby, Adam Davies, Chirag Rastogi, and Julia Hockenmaier. Measuring the reliability of causal probing methods. NeurIPS Workshop, 2024

  4. [4]

    Towards automated circuit discovery for mechanistic interpretability

    Arthur Conmy et al. Towards automated circuit discovery for mechanistic interpretability. In NeurIPS, 2023

  5. [5]

    Transcoders find interpretable LLM feature circuits

    Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits. In NeurIPS, 2024

  6. [6]

    Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In CLeaR, 2024

  7. [7]

    The probability approach in econometrics

    Trygve Haavelmo. The probability approach in econometrics. Econometrica, 12, 1944

  8. [8]

    James J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153--161, 1979

  9. [9]

    Paul W. Holland. Statistics and causal inference. JASA, 81(396):945--960, 1986

  10. [10]

    Rigorously assessing natural language explanations of neurons

    Jing Huang, Atticus Geiger, Karel D'Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In BlackboxNLP at EMNLP, 2023

  11. [11]

    Guido W. Imbens. Potential outcome and DAG approaches to causality: relevance for empirical practice in economics. JEL, 58(4):1129--1179, 2020

  12. [12]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159--174, 1977

  13. [13]

    2025 , archivePrefix=

    Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, and Neel Nanda. Sparse autoencoders do not find canonical units of analysis. arXiv:2502.04878, 2025

  14. [14]

    Is this the subspace you are looking for? An interpretability illusion for subspace activation patching

    Aleksandar Makelov, Georgina Lange, Atticus Geiger, and Neel Nanda. Is this the subspace you are looking for? An interpretability illusion for subspace activation patching. In ICLR, 2024

  15. [15]

    Causality: Models, Reasoning, and Inference

    Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge UP, 2nd ed., 2009

  16. [16]

    The Book of Why

    Judea Pearl and Dana Mackenzie. The Book of Why. Basic Books, 2018

  17. [17]

    Donald B. Rubin. Estimating causal effects of treatments in randomized and non-randomized studies. Journal of Educational Psychology, 66(5):688--701, 1974

  18. [18]

    Are emergent abilities of large language models a mirage? In NeurIPS, 2023

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In NeurIPS, 2023

  19. [19]

    Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet

    Adly Templeton et al. Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet. Anthropic Transformer Circuits Thread, 2024

  20. [20]

    Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

    Kevin Wang, Anna Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In ICLR, 2023

  21. [21]

    Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah D. Goodman. Interpretability at scale: identifying causal mechanisms in Alpaca. In NeurIPS, 2023