Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

Fengming Liu; Zezheng Lin

arxiv: 2605.08012 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

Zezheng Lin , Fengming Liu This is my paper

Pith reviewed 2026-05-11 02:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords mechanistic interpretabilitycausal claimsidentification assumptionsauditdisclosure normvalidation metricsneural network internals

0 comments

The pith

Causal claims about neural network internals require explicit identification assumptions in mechanistic interpretability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When papers in mechanistic interpretability describe specific model components as causing certain behaviors, those descriptions depend on assumptions that turn their measurements into evidence of causation. An audit of ten papers spanning four common methods found that none provided a dedicated section stating these assumptions. Instead, metrics such as faithfulness, completeness, monosemanticity, and ablation effects were presented as sufficient to support the causal claims. The authors recommend a disclosure practice in which researchers must state whether their claim is causal, name the identification strategy used, list the assumptions, highlight at least one key assumption, and explain how the conclusions would change if any assumption does not hold. This approach separates the role of validation from the requirements of causal identification.

Core claim

The central claim is that causal vocabulary in mechanistic interpretability, including references to circuits, mediators, and causal abstraction, necessitates the disclosure of identification assumptions. A purposive audit across ten papers in four methodological strands shows the absence of any dedicated identification-assumptions section, with validation metrics routinely substituted for explicit causal identification. The proposed norm requires papers to declare the claim as causal, identify the strategy, enumerate assumptions, stress at least one, and discuss sensitivity to assumption failure.

What carries the argument

Identification assumptions that link validation metrics to causal conclusions about model structures.

If this is right

Causal conclusions in the field would become easier to evaluate for their dependence on unstated premises.
Validation metrics alone would no longer be accepted as direct support for causal relations.
Papers would need to address how results change when key assumptions are violated.
The distinction between correlation captured by metrics and causation would be made explicit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting this norm might lead to more cautious interpretations of interpretability findings in downstream applications.
The same requirement could be extended to causal claims in other subfields of machine learning that rely on internal model analysis.

Load-bearing premise

The purposive selection of ten papers and the two-coder audit of thirty items are sufficient to demonstrate a recurring pattern of missing identification-assumption disclosure across mechanistic interpretability research.

What would settle it

A larger or random sample of papers that includes several with explicit sections on identification assumptions for causal claims would undermine the reported pattern.

Figures

Figures reproduced from arXiv: 2605.08012 by Fengming Liu, Zezheng Lin.

**Figure 2.** Figure 2: n = 30 two-human-coder audit (human inter-rater reliability). Left: Dim D under strict adjudication, by methodological strand. Sparse autoencoders (6/6) and causal abstraction (3/3) substitute validation for identification at full rate; activation patching shows the lowest rate (2/7), reflecting partial protective effect of the illusion-and-failure-modes literature within that strand. Right: Dim D = Yes oc… view at source ↗

read the original abstract

Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper makes a solid methodological point about needing explicit identification assumptions for causal claims in mechanistic interpretability, but its audit evidence is too thin to back broad claims about the field.

read the letter

The key takeaway is that this position paper correctly identifies a gap: when interpretability work uses causal language like circuits as mediators or causal abstraction, the metrics (faithfulness, ablation effects, monosemanticity) do not automatically identify those effects without stated assumptions. The five-part disclosure norm they propose—declare if the claim is causal, name the strategy, list assumptions, stress-test one, and discuss sensitivity—is a concrete and new suggestion for this subfield. It does a clean job separating validation from identification, which is a useful distinction that many papers blur. That part is worth paying attention to if you work in this area. The audit of 10 papers plus the n=30 check shows the pattern in the sampled cases, and the abstract-only review limitation noted in the stress test does not change that the examples given are plausible. The soft spot is exactly what the stress-test flags: purposive selection of 10 papers does not support field-wide generalization, and checking for a dedicated section is not the same as checking whether assumptions appear anywhere in the text or supplements. The coding sensitivity on the n=30 part further limits how much weight the evidence can carry. This is not a fatal flaw for a position piece, but it means the paper illustrates a concern rather than demonstrating prevalence. The work is for researchers in mechanistic interpretability who make causal statements and want clearer standards. It shows clear thinking on the causal inference side and honest engagement with the literature it cites. I would send it to peer review so the community can discuss the proposed norm, though the audit section would likely need reframing as illustrative.

Referee Report

2 major / 1 minor

Summary. The paper claims that mechanistic interpretability research increasingly uses causal vocabulary (circuits, mediators, causal abstraction) but fails to disclose the identification assumptions needed to support such claims. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions sections and a pattern of substituting validation metrics (faithfulness, completeness, monosemanticity, alignment, ablation effects) for causal evidence. A two-coder audit on n=30 reproduces the directional finding, though counts are coding-rule sensitive. The authors propose a five-part disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions change if assumptions fail.

Significance. If the pattern documented in the audit generalizes, the paper's call for explicit identification disclosure could meaningfully improve rigor in mechanistic interpretability by clarifying the boundary between correlational and causal claims. The constructive, actionable disclosure norm is a particular strength that provides a concrete template the community could adopt without requiring new methods.

major comments (2)

[Audit description] Audit of 10 papers (abstract and main audit section): the purposive selection of only 10 papers across four strands, combined with the coding-sensitive nature of the n=30 check, is insufficient to establish a field-wide pattern. Without explicit inclusion/exclusion criteria and full details on how papers were chosen, the evidence cannot rule out selection bias, weakening support for the general recommendation.
[Audit results] Validation-metric substitution pattern (abstract and results): the claim that metrics such as faithfulness or monosemanticity are reported as causal support rests on checking for dedicated sections. The audit does not appear to verify whether assumptions are discussed elsewhere in the papers (methods, discussion, or supplements), so absence of a dedicated section does not demonstrate that assumptions are unstated or unconsidered.

minor comments (1)

[Proposed norm] The five-part disclosure norm is clearly stated but would be strengthened by a short worked example showing how it applies to a concrete claim (e.g., a circuit or mediator finding).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The feedback highlights important methodological considerations for the audit that supports our position. We address each major comment below and outline specific revisions to improve transparency and rigor without altering the core argument for a disclosure norm.

read point-by-point responses

Referee: Audit of 10 papers (abstract and main audit section): the purposive selection of only 10 papers across four strands, combined with the coding-sensitive nature of the n=30 check, is insufficient to establish a field-wide pattern. Without explicit inclusion/exclusion criteria and full details on how papers were chosen, the evidence cannot rule out selection bias, weakening support for the general recommendation.

Authors: We agree that the audit is purposive rather than a systematic or representative survey and does not establish field-wide prevalence. Its purpose is to document the issue across prominent examples from four methodological strands, motivating the proposed norm. The n=30 check is a limited robustness check whose counts are indeed sensitive to coding rules, as already noted in the manuscript. To address selection bias concerns, we will revise the manuscript to include: (1) explicit inclusion/exclusion criteria for the 10 papers, (2) a step-by-step description of the selection process, and (3) the full coding protocol and inter-coder agreement details for the n=30 audit in an expanded appendix. These additions will allow readers to evaluate the audit's scope and limitations directly. revision: yes
Referee: Validation-metric substitution pattern (abstract and results): the claim that metrics such as faithfulness or monosemanticity are reported as causal support rests on checking for dedicated sections. The audit does not appear to verify whether assumptions are discussed elsewhere in the papers (methods, discussion, or supplements), so absence of a dedicated section does not demonstrate that assumptions are unstated or unconsidered.

Authors: We acknowledge that the initial audit emphasized the absence of dedicated identification-assumptions sections, which directly relates to the proposed norm. However, to more rigorously support the substitution pattern, we will re-examine the full text (including methods, results, discussion, and supplements) of the audited papers for any explicit discussion of identification assumptions, causal identification strategies, or how validation metrics support causal claims. The revised audit will report whether such discussions appear outside dedicated sections and adjust the characterization of current practices accordingly. This will provide a more complete assessment of whether assumptions are merely unsectioned or genuinely unstated. revision: yes

Circularity Check

0 steps flagged

Position paper's central claim rests on external audit of other papers with no self-referential derivation

full rationale

The paper is a position piece whose load-bearing step is an observational audit of 10 external papers (plus a secondary n=30 coding exercise) documenting the absence of dedicated identification-assumptions sections and the substitution of validation metrics for causal identification. No equations, fitted parameters, or derivations appear; the proposal of a disclosure norm follows directly from the external pattern observed rather than reducing to any quantity defined by the authors' own modeling choices. No self-citations are invoked as load-bearing support for the uniqueness or necessity of the claim. This is the standard case of a self-contained argument against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The argument applies standard causal-inference requirements to interpretability literature without introducing new parameters, axioms beyond domain assumptions, or invented entities.

axioms (1)

domain assumption Causal claims require explicit identification assumptions to be validly supported by data or models.
This premise is invoked throughout the abstract as the basis for criticizing current practices and proposing the disclosure norm.

pith-pipeline@v0.9.0 · 5443 in / 1250 out tokens · 54916 ms · 2026-05-11T02:39:10.973500+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Angrist, Guido W

Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects using instrumental variables. JASA, 91(434):444--455, 1996

work page 1996
[2]

Towards monosemanticity: decomposing language models with dictionary learning

Trenton Bricken et al. Towards monosemanticity: decomposing language models with dictionary learning. Anthropic Transformer Circuits Thread, 2023

work page 2023
[3]

Measuring the reliability of causal probing methods

Marc Canby, Adam Davies, Chirag Rastogi, and Julia Hockenmaier. Measuring the reliability of causal probing methods. NeurIPS Workshop, 2024

work page 2024
[4]

Towards automated circuit discovery for mechanistic interpretability

Arthur Conmy et al. Towards automated circuit discovery for mechanistic interpretability. In NeurIPS, 2023

work page 2023
[5]

Transcoders find interpretable LLM feature circuits

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits. In NeurIPS, 2024

work page 2024
[6]

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In CLeaR, 2024

work page 2024
[7]

The probability approach in econometrics

Trygve Haavelmo. The probability approach in econometrics. Econometrica, 12, 1944

work page 1944
[8]

James J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153--161, 1979

work page 1979
[9]

Paul W. Holland. Statistics and causal inference. JASA, 81(396):945--960, 1986

work page 1986
[10]

Rigorously assessing natural language explanations of neurons

Jing Huang, Atticus Geiger, Karel D'Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In BlackboxNLP at EMNLP, 2023

work page 2023
[11]

Guido W. Imbens. Potential outcome and DAG approaches to causality: relevance for empirical practice in economics. JEL, 58(4):1129--1179, 2020

work page 2020
[12]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159--174, 1977

work page 1977
[13]

2025 , archivePrefix=

Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, and Neel Nanda. Sparse autoencoders do not find canonical units of analysis. arXiv:2502.04878, 2025

work page arXiv 2025
[14]

Is this the subspace you are looking for? An interpretability illusion for subspace activation patching

Aleksandar Makelov, Georgina Lange, Atticus Geiger, and Neel Nanda. Is this the subspace you are looking for? An interpretability illusion for subspace activation patching. In ICLR, 2024

work page 2024
[15]

Causality: Models, Reasoning, and Inference

Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge UP, 2nd ed., 2009

work page 2009
[16]

The Book of Why

Judea Pearl and Dana Mackenzie. The Book of Why. Basic Books, 2018

work page 2018
[17]

Donald B. Rubin. Estimating causal effects of treatments in randomized and non-randomized studies. Journal of Educational Psychology, 66(5):688--701, 1974

work page 1974
[18]

Are emergent abilities of large language models a mirage? In NeurIPS, 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In NeurIPS, 2023

work page 2023
[19]

Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet

Adly Templeton et al. Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet. Anthropic Transformer Circuits Thread, 2024

work page 2024
[20]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Wang, Anna Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In ICLR, 2023

work page 2023
[21]

Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah D. Goodman. Interpretability at scale: identifying causal mechanisms in Alpaca. In NeurIPS, 2023

work page 2023

[1] [1]

Angrist, Guido W

Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects using instrumental variables. JASA, 91(434):444--455, 1996

work page 1996

[2] [2]

Towards monosemanticity: decomposing language models with dictionary learning

Trenton Bricken et al. Towards monosemanticity: decomposing language models with dictionary learning. Anthropic Transformer Circuits Thread, 2023

work page 2023

[3] [3]

Measuring the reliability of causal probing methods

Marc Canby, Adam Davies, Chirag Rastogi, and Julia Hockenmaier. Measuring the reliability of causal probing methods. NeurIPS Workshop, 2024

work page 2024

[4] [4]

Towards automated circuit discovery for mechanistic interpretability

Arthur Conmy et al. Towards automated circuit discovery for mechanistic interpretability. In NeurIPS, 2023

work page 2023

[5] [5]

Transcoders find interpretable LLM feature circuits

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits. In NeurIPS, 2024

work page 2024

[6] [6]

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In CLeaR, 2024

work page 2024

[7] [7]

The probability approach in econometrics

Trygve Haavelmo. The probability approach in econometrics. Econometrica, 12, 1944

work page 1944

[8] [8]

James J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153--161, 1979

work page 1979

[9] [9]

Paul W. Holland. Statistics and causal inference. JASA, 81(396):945--960, 1986

work page 1986

[10] [10]

Rigorously assessing natural language explanations of neurons

Jing Huang, Atticus Geiger, Karel D'Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In BlackboxNLP at EMNLP, 2023

work page 2023

[11] [11]

Guido W. Imbens. Potential outcome and DAG approaches to causality: relevance for empirical practice in economics. JEL, 58(4):1129--1179, 2020

work page 2020

[12] [12]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159--174, 1977

work page 1977

[13] [13]

2025 , archivePrefix=

Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, and Neel Nanda. Sparse autoencoders do not find canonical units of analysis. arXiv:2502.04878, 2025

work page arXiv 2025

[14] [14]

Is this the subspace you are looking for? An interpretability illusion for subspace activation patching

Aleksandar Makelov, Georgina Lange, Atticus Geiger, and Neel Nanda. Is this the subspace you are looking for? An interpretability illusion for subspace activation patching. In ICLR, 2024

work page 2024

[15] [15]

Causality: Models, Reasoning, and Inference

Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge UP, 2nd ed., 2009

work page 2009

[16] [16]

The Book of Why

Judea Pearl and Dana Mackenzie. The Book of Why. Basic Books, 2018

work page 2018

[17] [17]

Donald B. Rubin. Estimating causal effects of treatments in randomized and non-randomized studies. Journal of Educational Psychology, 66(5):688--701, 1974

work page 1974

[18] [18]

Are emergent abilities of large language models a mirage? In NeurIPS, 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In NeurIPS, 2023

work page 2023

[19] [19]

Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet

Adly Templeton et al. Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet. Anthropic Transformer Circuits Thread, 2024

work page 2024

[20] [20]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Wang, Anna Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In ICLR, 2023

work page 2023

[21] [21]

Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah D. Goodman. Interpretability at scale: identifying causal mechanisms in Alpaca. In NeurIPS, 2023

work page 2023