Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims
Pith reviewed 2026-05-11 02:39 UTC · model grok-4.3
The pith
Causal claims about neural network internals require explicit identification assumptions in mechanistic interpretability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that causal vocabulary in mechanistic interpretability, including references to circuits, mediators, and causal abstraction, necessitates the disclosure of identification assumptions. A purposive audit across ten papers in four methodological strands shows the absence of any dedicated identification-assumptions section, with validation metrics routinely substituted for explicit causal identification. The proposed norm requires papers to declare the claim as causal, identify the strategy, enumerate assumptions, stress at least one, and discuss sensitivity to assumption failure.
What carries the argument
Identification assumptions that link validation metrics to causal conclusions about model structures.
If this is right
- Causal conclusions in the field would become easier to evaluate for their dependence on unstated premises.
- Validation metrics alone would no longer be accepted as direct support for causal relations.
- Papers would need to address how results change when key assumptions are violated.
- The distinction between correlation captured by metrics and causation would be made explicit.
Where Pith is reading between the lines
- Adopting this norm might lead to more cautious interpretations of interpretability findings in downstream applications.
- The same requirement could be extended to causal claims in other subfields of machine learning that rely on internal model analysis.
Load-bearing premise
The purposive selection of ten papers and the two-coder audit of thirty items are sufficient to demonstrate a recurring pattern of missing identification-assumption disclosure across mechanistic interpretability research.
What would settle it
A larger or random sample of papers that includes several with explicit sections on identification assumptions for causal claims would undermine the reported pattern.
Figures
read the original abstract
Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that mechanistic interpretability research increasingly uses causal vocabulary (circuits, mediators, causal abstraction) but fails to disclose the identification assumptions needed to support such claims. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions sections and a pattern of substituting validation metrics (faithfulness, completeness, monosemanticity, alignment, ablation effects) for causal evidence. A two-coder audit on n=30 reproduces the directional finding, though counts are coding-rule sensitive. The authors propose a five-part disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions change if assumptions fail.
Significance. If the pattern documented in the audit generalizes, the paper's call for explicit identification disclosure could meaningfully improve rigor in mechanistic interpretability by clarifying the boundary between correlational and causal claims. The constructive, actionable disclosure norm is a particular strength that provides a concrete template the community could adopt without requiring new methods.
major comments (2)
- [Audit description] Audit of 10 papers (abstract and main audit section): the purposive selection of only 10 papers across four strands, combined with the coding-sensitive nature of the n=30 check, is insufficient to establish a field-wide pattern. Without explicit inclusion/exclusion criteria and full details on how papers were chosen, the evidence cannot rule out selection bias, weakening support for the general recommendation.
- [Audit results] Validation-metric substitution pattern (abstract and results): the claim that metrics such as faithfulness or monosemanticity are reported as causal support rests on checking for dedicated sections. The audit does not appear to verify whether assumptions are discussed elsewhere in the papers (methods, discussion, or supplements), so absence of a dedicated section does not demonstrate that assumptions are unstated or unconsidered.
minor comments (1)
- [Proposed norm] The five-part disclosure norm is clearly stated but would be strengthened by a short worked example showing how it applies to a concrete claim (e.g., a circuit or mediator finding).
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. The feedback highlights important methodological considerations for the audit that supports our position. We address each major comment below and outline specific revisions to improve transparency and rigor without altering the core argument for a disclosure norm.
read point-by-point responses
-
Referee: Audit of 10 papers (abstract and main audit section): the purposive selection of only 10 papers across four strands, combined with the coding-sensitive nature of the n=30 check, is insufficient to establish a field-wide pattern. Without explicit inclusion/exclusion criteria and full details on how papers were chosen, the evidence cannot rule out selection bias, weakening support for the general recommendation.
Authors: We agree that the audit is purposive rather than a systematic or representative survey and does not establish field-wide prevalence. Its purpose is to document the issue across prominent examples from four methodological strands, motivating the proposed norm. The n=30 check is a limited robustness check whose counts are indeed sensitive to coding rules, as already noted in the manuscript. To address selection bias concerns, we will revise the manuscript to include: (1) explicit inclusion/exclusion criteria for the 10 papers, (2) a step-by-step description of the selection process, and (3) the full coding protocol and inter-coder agreement details for the n=30 audit in an expanded appendix. These additions will allow readers to evaluate the audit's scope and limitations directly. revision: yes
-
Referee: Validation-metric substitution pattern (abstract and results): the claim that metrics such as faithfulness or monosemanticity are reported as causal support rests on checking for dedicated sections. The audit does not appear to verify whether assumptions are discussed elsewhere in the papers (methods, discussion, or supplements), so absence of a dedicated section does not demonstrate that assumptions are unstated or unconsidered.
Authors: We acknowledge that the initial audit emphasized the absence of dedicated identification-assumptions sections, which directly relates to the proposed norm. However, to more rigorously support the substitution pattern, we will re-examine the full text (including methods, results, discussion, and supplements) of the audited papers for any explicit discussion of identification assumptions, causal identification strategies, or how validation metrics support causal claims. The revised audit will report whether such discussions appear outside dedicated sections and adjust the characterization of current practices accordingly. This will provide a more complete assessment of whether assumptions are merely unsectioned or genuinely unstated. revision: yes
Circularity Check
Position paper's central claim rests on external audit of other papers with no self-referential derivation
full rationale
The paper is a position piece whose load-bearing step is an observational audit of 10 external papers (plus a secondary n=30 coding exercise) documenting the absence of dedicated identification-assumptions sections and the substitution of validation metrics for causal identification. No equations, fitted parameters, or derivations appear; the proposal of a disclosure norm follows directly from the external pattern observed rather than reducing to any quantity defined by the authors' own modeling choices. No self-citations are invoked as load-bearing support for the uniqueness or necessity of the claim. This is the standard case of a self-contained argument against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Causal claims require explicit identification assumptions to be validly supported by data or models.
Reference graph
Works this paper leans on
-
[1]
Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects using instrumental variables. JASA, 91(434):444--455, 1996
work page 1996
-
[2]
Towards monosemanticity: decomposing language models with dictionary learning
Trenton Bricken et al. Towards monosemanticity: decomposing language models with dictionary learning. Anthropic Transformer Circuits Thread, 2023
work page 2023
-
[3]
Measuring the reliability of causal probing methods
Marc Canby, Adam Davies, Chirag Rastogi, and Julia Hockenmaier. Measuring the reliability of causal probing methods. NeurIPS Workshop, 2024
work page 2024
-
[4]
Towards automated circuit discovery for mechanistic interpretability
Arthur Conmy et al. Towards automated circuit discovery for mechanistic interpretability. In NeurIPS, 2023
work page 2023
-
[5]
Transcoders find interpretable LLM feature circuits
Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits. In NeurIPS, 2024
work page 2024
-
[6]
Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In CLeaR, 2024
work page 2024
-
[7]
The probability approach in econometrics
Trygve Haavelmo. The probability approach in econometrics. Econometrica, 12, 1944
work page 1944
-
[8]
James J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153--161, 1979
work page 1979
-
[9]
Paul W. Holland. Statistics and causal inference. JASA, 81(396):945--960, 1986
work page 1986
-
[10]
Rigorously assessing natural language explanations of neurons
Jing Huang, Atticus Geiger, Karel D'Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In BlackboxNLP at EMNLP, 2023
work page 2023
-
[11]
Guido W. Imbens. Potential outcome and DAG approaches to causality: relevance for empirical practice in economics. JEL, 58(4):1129--1179, 2020
work page 2020
-
[12]
J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159--174, 1977
work page 1977
-
[13]
Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, and Neel Nanda. Sparse autoencoders do not find canonical units of analysis. arXiv:2502.04878, 2025
-
[14]
Aleksandar Makelov, Georgina Lange, Atticus Geiger, and Neel Nanda. Is this the subspace you are looking for? An interpretability illusion for subspace activation patching. In ICLR, 2024
work page 2024
-
[15]
Causality: Models, Reasoning, and Inference
Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge UP, 2nd ed., 2009
work page 2009
- [16]
-
[17]
Donald B. Rubin. Estimating causal effects of treatments in randomized and non-randomized studies. Journal of Educational Psychology, 66(5):688--701, 1974
work page 1974
-
[18]
Are emergent abilities of large language models a mirage? In NeurIPS, 2023
Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In NeurIPS, 2023
work page 2023
-
[19]
Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet
Adly Templeton et al. Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet. Anthropic Transformer Circuits Thread, 2024
work page 2024
-
[20]
Interpretability in the wild: a circuit for indirect object identification in GPT-2 small
Kevin Wang, Anna Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In ICLR, 2023
work page 2023
-
[21]
Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah D. Goodman. Interpretability at scale: identifying causal mechanisms in Alpaca. In NeurIPS, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.