pith. sign in

arxiv: 2606.06267 · v1 · pith:QGSTI6EJnew · submitted 2026-06-04 · 💻 cs.CL

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

Pith reviewed 2026-06-28 01:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords circuit discoverymechanistic interpretabilityphantom specializationlanguage modelssequence copyinginput variationedge evaluation
0
0 comments X

The pith

Structurally distinct circuits implement the same computation when input frequency varies but the task is fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether structural differences in discovered circuits necessarily indicate distinct mechanisms by running circuit discovery on a literal sequence copying task across four token-frequency bands in Pythia models. It finds that the resulting circuits appear specialized by band yet prove functionally equivalent: edges from one band transfer to others, a shared core recovers at least 99 percent of performance, and causal interchange interventions show interchangeable internal representations. This pattern of phantom specialization arises because discovery algorithms sample from an equivalence class of valid subgraphs. Standard source-level evaluation hides the many-to-one structure-to-function mapping while edge-level evaluation and cross-band tests expose it.

Core claim

Structurally distinct circuits implement the same computation: band-specific edges transfer broadly across bands, a core shared across most bands recovers at least 99% of circuit performance, and causal interchange interventions confirm that internal representations are interchangeable across frequency bands.

What carries the argument

Phantom specialization: apparent structural differences between circuits that do not correspond to functional differences, exposed by cross-condition edge transfer and edge-level evaluation.

If this is right

  • Band-specific edges transfer broadly across frequency bands without loss of function.
  • A shared core across most bands recovers at least 99% of circuit performance.
  • Causal interchange interventions confirm interchangeable internal representations across bands.
  • Repeated extractions within one band sample from an equivalence class of valid subgraphs.
  • Edge-level evaluation reveals the many-to-one mapping while source-level evaluation inflates apparent faithfulness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interpretability work should require functional equivalence tests rather than relying on structural comparisons alone.
  • Circuit discovery methods may routinely return one of many equivalent subgraphs for a given behavior.
  • The same task may admit multiple valid circuits whose differences reflect sampling rather than distinct mechanisms.

Load-bearing premise

Varying token frequency while holding the literal sequence copying task fixed isolates the effect of input statistics on circuit structure.

What would settle it

An interchange intervention between circuits extracted from different frequency bands that fails to preserve task performance would falsify the claim of interchangeable representations.

Figures

Figures reproduced from arXiv: 2606.06267 by Alireza Bayat Makou, Iryna Gurevych, Jingcheng Niu, Subhabrata Dutta.

Figure 1
Figure 1. Figure 1: Experimental design and analytical framework. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Two paradigms for circuit discovery. (a) Manual discovery follows a five-step workflow (Rai et al., 2024): select a target behavior, define the graph and granularity, localize important components via intervention, interpret their roles, and evaluate the result, iterating between localization and interpretation until a stable working hypothesis emerges (Olsson et al., 2022; Wang et al., 2023; Hanna et al.,… view at source ↗
Figure 3
Figure 3. Figure 3: LSC sequence structure and corruption. Top (clean): the model observes a source prefix S1–5 followed by target T, a distraction segment R1–10, and a repetition of the source prefix, then must predict T at the final position. All 16 tokens are sampled from the same frequency band. Bottom (corrupt): the repeated source prefix is replaced with random tokens X1–5 from the same frequency band, destroying the re… view at source ↗
Figure 4
Figure 4. Figure 4: The similarity triangle: three axes for comparing circuits, each probed both correlationally (metric [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Edge sharing spectrum by model. Each bar shows the fraction of edges appearing in exactly [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Circuit accuracy as a function of sharing threshold [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Logit lens trajectories: universal core (dashed) vs. full circuit (solid), with the base model (gray) [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance gap decomposition. The gap between the universal core (edge-level) and full circuit [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Simpson’s paradox in cross-perspective correlations. Universal edge fraction and circuit size fraction [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Common ways to construct the corrupt run for activation patching. Simpler ablations are [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Activation patching and its two main directions. [PITH_FULL_IMAGE:figures/full_fig_p045_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Data construction pipeline. Each stage (blue) transforms the output of the previous step into [PITH_FULL_IMAGE:figures/full_fig_p046_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Token vocabulary overview. Top left: category distribution; word_en tokens dominate the vocabulary. Top right: log-frequency distributions by category. Bottom left: word_en frequency distri￾bution with percentile markers (p1, p25, p75, p99) defining the core range for band construction. Bottom right: character length distributions by category. Final scheme: eight conditions. The five core bands span the g… view at source ↗
Figure 14
Figure 14. Figure 14: Confound profile across frequency bands for word-validated pools. [PITH_FULL_IMAGE:figures/full_fig_p049_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Pareto sweep results for two representative Pythia models on the control band (top: 70m; bottom: [PITH_FULL_IMAGE:figures/full_fig_p051_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Same-band and cross-band accuracy of EAP-IG circuits as a function of circuit size (as a multiple [PITH_FULL_IMAGE:figures/full_fig_p054_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Transfer efficiency across all three circuit discovery methods (ACDC, EAP, EAP-IG) at the [PITH_FULL_IMAGE:figures/full_fig_p055_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Base model top-1 accuracy across frequency bands. Pythia-70m shows a clear frequency gradient; [PITH_FULL_IMAGE:figures/full_fig_p056_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Circuit size (edge fraction) across models and frequency bands. Within-model variation across [PITH_FULL_IMAGE:figures/full_fig_p056_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Base model, circuit, and ablation accuracy across models and bands. Ablation accuracy is at or [PITH_FULL_IMAGE:figures/full_fig_p057_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Cross-band transfer matrices for all five models (averaged over draws). Rows = training band, [PITH_FULL_IMAGE:figures/full_fig_p060_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Asymmetric transfer: low-frequency (LF) circuits evaluated on high-frequency (HF) data consis [PITH_FULL_IMAGE:figures/full_fig_p060_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Circuit properties as a function of model size. Larger models yield more faithful (higher accuracy, [PITH_FULL_IMAGE:figures/full_fig_p061_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Circuit accuracy versus circuit size (edge fraction) for all 75 circuits. Larger models achieve higher [PITH_FULL_IMAGE:figures/full_fig_p061_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Control circuit matches the accuracy and transfer of frequency-specific circuits. [PITH_FULL_IMAGE:figures/full_fig_p062_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Head participation rate by model and band. [PITH_FULL_IMAGE:figures/full_fig_p064_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Universal vs. band-specific edge counts per model. [PITH_FULL_IMAGE:figures/full_fig_p064_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Band affinity heatmaps per model [PITH_FULL_IMAGE:figures/full_fig_p066_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Embedding-layer representational metrics by model size. [PITH_FULL_IMAGE:figures/full_fig_p068_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Residual-stream representational trajectories across layers for all five Pythia models. [PITH_FULL_IMAGE:figures/full_fig_p068_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Logit lens convergence layer by model and frequency band. [PITH_FULL_IMAGE:figures/full_fig_p069_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Attention entropy by layer and model. metric (Olsson et al., 2022) measures attention to the token following a bigram repetition, whereas LSC uses a five-token prefix before the repeated segment begins. The longer prefix means that the two-token pattern￾matching heuristic does not fire, even though the heads perform the same underlying copy computation. BOS-sink heads dominate (35–63%); induction heads ar… view at source ↗
Figure 33
Figure 33. Figure 33: Fraction of frequency-selective MLP neurons by layer and model. [PITH_FULL_IMAGE:figures/full_fig_p070_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Information-theoretic layer-wise analyses across all five Pythia models. [PITH_FULL_IMAGE:figures/full_fig_p071_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Layer-wise representational metrics across all five Pythia models. (a) Logit lens convergence [PITH_FULL_IMAGE:figures/full_fig_p071_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Interchange intervention logit difference by layer (residual stream, prediction position). The [PITH_FULL_IMAGE:figures/full_fig_p072_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Full 5×5 IIA matrices at peak layer. Rows: base band; columns: source band. Near-uniform values indicate that the model uses a shared representational format across all bands. Position and component analysis. At the peak layer, sweeping patching position across all token po￾sitions shows that IIA is zero everywhere except the final prediction position (position 21), confirming that band-distinguishing inf… view at source ↗
Figure 38
Figure 38. Figure 38: Positive control: source IIA across layers for within-band different-target interchange patching. [PITH_FULL_IMAGE:figures/full_fig_p073_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Boundless DAS effective subspace dimension across models. Left: absolute dimension; right: [PITH_FULL_IMAGE:figures/full_fig_p074_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Quantitative similarity triangle. Mean absolute Spearman correlations computed over all metric [PITH_FULL_IMAGE:figures/full_fig_p075_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Hierarchically clustered correlation matrix of 26 unified metrics. Block structure aligns with [PITH_FULL_IMAGE:figures/full_fig_p077_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Scaling consistency across perspectives. Six representative metrics (two per perspective) plotted [PITH_FULL_IMAGE:figures/full_fig_p078_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Universal core accuracy across models and frequency bands. Retention decreases with model [PITH_FULL_IMAGE:figures/full_fig_p080_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Universal core accuracy vs. size-matched random edge sets. The universal core advantage is [PITH_FULL_IMAGE:figures/full_fig_p080_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Cross-band accuracy boost from band-specific edges. Each panel shows one model; rows are [PITH_FULL_IMAGE:figures/full_fig_p081_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Accuracy as a function of sharing threshold [PITH_FULL_IMAGE:figures/full_fig_p081_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Critical sharing threshold per model and band. Most configurations reach 95% recovery at [PITH_FULL_IMAGE:figures/full_fig_p082_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: Edge sharing distribution by model. Smaller models have a higher fraction of universal (5-band) [PITH_FULL_IMAGE:figures/full_fig_p083_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Attention head role distribution by universality class. [PITH_FULL_IMAGE:figures/full_fig_p083_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: Necessity and sufficiency matrix. The universal core is both sufficient and necessary (complement [PITH_FULL_IMAGE:figures/full_fig_p086_50.png] view at source ↗
Figure 51
Figure 51. Figure 51: Logit lens trajectory overlay: universal core vs. full circuit. The universal core follows the same [PITH_FULL_IMAGE:figures/full_fig_p087_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: Gap decomposition across sharing tiers. Including [PITH_FULL_IMAGE:figures/full_fig_p087_52.png] view at source ↗
Figure 53
Figure 53. Figure 53: Cross-band accuracy under resample (top) vs. zero ablation (bottom). The same-band advantage [PITH_FULL_IMAGE:figures/full_fig_p088_53.png] view at source ↗
read the original abstract

Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by varying input statistics while holding the task fixed, and show that the resulting structural differences exhibit apparent specialization but do not correspond to functional differences, a pattern we term phantom specialization. Using Literal Sequence Copying across four token-frequency bands plus a control condition in five Pythia models (70M-1.4B), we extract 75 circuits and find that structurally distinct circuits implement the same computation: band-specific edges transfer broadly across bands, a core shared across most bands recovers at least 99% of circuit performance, and causal interchange interventions confirm that internal representations are interchangeable across frequency bands. Repeated extractions within the same frequency band further suggest that discovery algorithms sample from an equivalence class of valid subgraphs rather than recovering a unique mechanism. Standard evaluation practice obscures this pattern: source-level evaluation inflates apparent faithfulness, while edge-level evaluation reveals the many-to-one mapping from structure to function. Our results show that structural differences between circuits are not sufficient evidence for distinct mechanisms, and that exposing this requires edge-level evaluation and cross-condition transfer tests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that circuit discovery methods yield structurally distinct subgraphs for the same task (literal sequence copying) when input token-frequency statistics are varied across four bands plus control, but these differences reflect 'phantom specialization' rather than distinct mechanisms. Using 75 circuits extracted from five Pythia models (70M–1.4B), the authors show that band-specific edges transfer broadly, a shared core across most bands recovers ≥99% of per-band circuit performance, and causal interchange interventions demonstrate that internal representations are interchangeable across bands. Repeated within-band extractions suggest discovery algorithms sample from an equivalence class of valid subgraphs. The work argues that source-level evaluation inflates faithfulness while edge-level evaluation reveals the many-to-one structure-to-function mapping, implying structural differences alone are insufficient evidence for distinct mechanisms.

Significance. If the empirical results hold, the paper makes a substantive contribution to mechanistic interpretability by providing converging evidence (cross-band transfer, shared-core recovery, and interchange interventions) that challenges the common assumption that structural variation in discovered circuits implies functional specialization. The clean isolation of input statistics while holding the task fixed, combined with the emphasis on evaluation granularity, offers a practical methodological caution and supports the view that circuits may belong to equivalence classes rather than unique mechanisms.

major comments (2)
  1. [§4.2] §4.2 (cross-band transfer results): the reported broad transfer of band-specific edges is central to the phantom-specialization claim, yet the manuscript does not report the exact threshold used to classify an edge as 'transferring' or the statistical test for whether transfer rates differ significantly from within-band baselines.
  2. [§5] §5 (interchange intervention protocol): the claim that representations are interchangeable rests on the interchange tests recovering performance; however, the description does not specify how the source and target activations are aligned when the circuits differ in edge sets, which is load-bearing for interpreting the results as evidence of functional equivalence.
minor comments (3)
  1. [§3.1] The abstract and §3.1 refer to '75 circuits' but do not state how many independent extractions were performed per band-model pair; this detail would clarify the within-band variability analysis.
  2. [Figure 3] Figure 3 (edge-level vs. source-level faithfulness) would benefit from error bars or per-model scatter to show consistency of the inflation effect across the five model sizes.
  3. [§3] Notation for the four frequency bands is introduced in §3 but the exact token-frequency cutoffs are only given in an appendix; moving the definition to the main text would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the positive assessment and constructive comments. We address each major comment below and will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (cross-band transfer results): the reported broad transfer of band-specific edges is central to the phantom-specialization claim, yet the manuscript does not report the exact threshold used to classify an edge as 'transferring' or the statistical test for whether transfer rates differ significantly from within-band baselines.

    Authors: We agree that these details are necessary for reproducibility. We will revise §4.2 to explicitly report the threshold used to classify an edge as transferring and to include the statistical test (with results) comparing cross-band transfer rates to within-band baselines. revision: yes

  2. Referee: [§5] §5 (interchange intervention protocol): the claim that representations are interchangeable rests on the interchange tests recovering performance; however, the description does not specify how the source and target activations are aligned when the circuits differ in edge sets, which is load-bearing for interpreting the results as evidence of functional equivalence.

    Authors: We thank the referee for highlighting this omission. We will revise §5 to specify the alignment procedure for activations (mapping by layer and component indices in the computational graph) when source and target circuits have differing edge sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study that extracts circuits via standard methods, measures cross-band transfer, shared-core performance, and interchange interventions on fixed-task inputs with varied statistics. No derivation, equation, or first-principles claim is present that could reduce to fitted parameters, self-definitions, or self-citation chains. All load-bearing evidence consists of direct experimental measurements (edge transfer rates, faithfulness scores >=99%, interchange success) that are falsifiable outside any internal fit. The design isolates input distribution while holding task fixed, and results are reported via standard evaluation metrics without renaming or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper is empirical and rests on standard assumptions from the mechanistic interpretability literature; it introduces a descriptive term but no new mathematical entities or fitted parameters.

axioms (1)
  • domain assumption Circuit discovery methods identify subgraphs that explain specific model behaviors
    Foundational premise of the field that the paper tests by varying inputs.
invented entities (1)
  • phantom specialization no independent evidence
    purpose: Descriptive label for the pattern where structural differences do not correspond to functional differences
    New term introduced to name the observed phenomenon; no independent evidence provided beyond the experiments.

pith-pipeline@v0.9.1-grok · 5757 in / 1312 out tokens · 50084 ms · 2026-06-28T01:45:32.374931+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

113 extracted references · 31 canonical work pages · 2 internal anchors

  1. [1]

    An argument for divine providence, taken from the constant regularity observ'd in the births of both sexes

    II. An argument for divine providence, taken from the constant regularity observ'd in the births of both sexes. By Dr. John Arbuthnott, Physitian in Ordinary to Her Majesty, and Fellow of the College of Physitians and the Royal Society , author =. 1710 , journal =. doi:10.1098/rstl.1710.0011 , url =

  2. [2]

    2025 , booktitle =

    On Mechanistic Circuits for Extractive Question-Answering , author =. 2025 , booktitle =

  3. [3]

    2023 , url =

    Eliciting Latent Predictions from Transformers with the Tuned Lens , author =. 2023 , url =. 2303.08112 , archiveprefix =

  4. [4]

    1995 , journal =

    Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing , author =. 1995 , journal =

  5. [5]

    2024 , booktitle =

    Finding Transformer Circuits With Edge Pruning , author =. 2024 , booktitle =. doi:10.52202/079017-0587 , url =

  6. [6]

    Stella Biderman and Hailey Schoelkopf and Quentin Gregory Anthony and Herbie Bradley and Kyle O'Brien and Eric Hallahan and Mohammad Aflah Khan and Shivanshu Purohit and USVSN Sai Prashanth and Edward Raff and Aviya Skowron and Lintang Sutawika and Oskar van der Wal , year =. Pythia:. International Conference on Machine Learning,

  7. [7]

    Tolga Bolukbasi and Adam Pearce and Ann Yuan and Andy Coenen and Emily Reif and Fernanda B. Vi. An Interpretability Illusion for. 2021 , journal =. 2104.07143 , timestamp =

  8. [8]

    2024 , booktitle =

    Using Degeneracy in the Loss Landscape for Mechanistic Interpretability , author =. 2024 , booktitle =

  9. [9]

    2022 , journal =

    Causal scrubbing, a method for rigorously testing interpretability hypotheses , author =. 2022 , journal =

  10. [10]

    2023 , booktitle =

    A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations , author =. 2023 , booktitle =

  11. [11]

    2013 , publisher =

    Statistical Power Analysis for the Behavioral Sciences , author =. 2013 , publisher =

  12. [12]

    2025 , booktitle =

    Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning , author =. 2025 , booktitle =. doi:10.18653/v1/2025.findings-naacl.283 , url =

  13. [13]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , year =. Proceedings of the 2019 Conference of the North. doi:10.18653/v1/N19-1423 , url =

  14. [14]

    Transcoders find interpretable

    Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , year =. Transcoders find interpretable. Advances in Neural Information Processing Systems , volume =. doi:10.52202/079017-0768 , url =

  15. [15]

    2024 , journal =

    How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning , author =. 2024 , journal =

  16. [16]

    2001 , journal =

    Degeneracy and complexity in biological systems , author =. 2001 , journal =. doi:10.1073/pnas.231499798 , url =. https://www.pnas.org/doi/pdf/10.1073/pnas.231499798 , abstract =

  17. [17]

    2024 , booktitle =

    The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains , author =. 2024 , booktitle =

  18. [18]

    1987 , journal =

    Better Bootstrap Confidence Intervals , author =. 1987 , journal =

  19. [19]

    2021 , journal =

    A Mathematical Framework for Transformer Circuits , author =. 2021 , journal =

  20. [20]

    2022 , journal =

    Toy Models of Superposition , author =. 2022 , journal =

  21. [21]

    2024 , booktitle =

    On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task , author =. 2024 , booktitle =. doi:10.18653/v1/2024.findings-emnlp.591 , url =

  22. [22]

    doi: 10.18653/v1/2021.acl-long.144

    Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models , author =. 2021 , booktitle =. doi:10.18653/v1/2021.acl-long.144 , url =

  23. [23]

    1922 , journal =

    On the Interpretation of ^2 from Contingency Tables, and the Calculation of P , author =. 1922 , journal =

  24. [24]

    1966 , publisher =

    The Design of Experiments , author =. 1966 , publisher =

  25. [25]

    2026 , url =

    Finding Interpretable Prompt-Specific Circuits in Language Models , author =. 2026 , url =. 2602.13483 , archiveprefix =

  26. [26]

    2021 , journal =

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author =. 2021 , journal =. 2101.00027 , timestamp =

  27. [27]

    How does

    Jorge Garc. How does. 2024 , booktitle =

  28. [28]

    2024 , journal =

    Adversarial Circuit Evaluation , author =. 2024 , journal =. doi:10.48550/ARXIV.2407.15166 , url =. 2407.15166 , timestamp =

  29. [29]

    Goodman and Christopher Potts and Thomas Icard , year =

    Atticus Geiger and Duligur Ibeling and Amir Zur and Maheep Chaudhary and Sonakshi Chauhan and Jing Huang and Aryaman Arora and Zhengxuan Wu and Noah D. Goodman and Christopher Potts and Thomas Icard , year =. Causal Abstraction:. J. Mach. Learn. Res. , volume =

  30. [30]

    2024 , booktitle =

    Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author =. 2024 , booktitle =

  31. [31]

    Localizing Model Behavior with Path Patching

    Localizing Model Behavior with Path Patching , author =. 2023 , journal =. doi:10.48550/ARXIV.2304.05969 , url =. 2304.05969 , timestamp =

  32. [32]

    2018 , booktitle =

    FRAGE: Frequency-Agnostic Word Representation , author =. 2018 , booktitle =

  33. [33]

    Gould, S. J. and Lewontin, R. C. , year =. The spandrels of. Proceedings of the Royal Society of London. B. Biological Sciences , volume =. doi:10.1098/rspb.1979.0086 , url =

  34. [34]

    Wang, Ben and Komatsuzaki, Aran , year =

  35. [35]

    GPT - N eo X -20 B : An Open-Source Autoregressive Language Model

    Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , year =. Proceedings of BigScience E...

  36. [36]

    Proceedings of the 62nd

    Groeneveld, Dirk and Beltagy, Iz and Walsh, Evan and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack and Khot, Tus...

  37. [37]

    2025 , booktitle =

    Position-aware Automatic Circuit Discovery , author =. 2025 , booktitle =. doi:10.18653/v1/2025.acl-long.141 , url =

  38. [38]

    How does

    Michael Hanna and Ollie Liu and Alexandre Variengien , year =. How does. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , url =

  39. [39]

    2024 , booktitle =

    Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms , author =. 2024 , booktitle =

  40. [40]

    A circuit for

    Heimersheim, Stefan and Janiak, Jett , year =. A circuit for

  41. [41]

    How to use and interpret activation patching

    How to use and interpret activation patching , author =. 2024 , journal =. doi:10.48550/ARXIV.2404.15255 , url =. 2404.15255 , timestamp =

  42. [42]

    Quanti- fying causal emergence shows that macro can beat micro

    Quantifying causal emergence shows that macro can beat micro , author =. 2013 , journal =. doi:10.1073/pnas.1314922110 , url =. https://www.pnas.org/doi/pdf/10.1073/pnas.1314922110 , abstract =

  43. [43]

    2024 , booktitle =

    Successor Heads: Recurring, Interpretable Attention Heads In The Wild , author =. 2024 , booktitle =

  44. [44]

    Bulletin de la Soci

    Jaccard, Paul , year =. Bulletin de la Soci

  45. [45]

    1954 , journal =

    A Distribution-Free k-Sample Test Against Ordered Alternatives , author =. 1954 , journal =

  46. [46]

    2019 , booktitle =

    Similarity of Neural Network Representations Revisited , author =. 2019 , booktitle =

  47. [47]

    Atp*: An efficient and scalable method for localizing llm behaviour to components

    J. AtP*: An efficient and scalable method for localizing. 2024 , journal =. doi:10.48550/ARXIV.2403.00745 , url =. 2403.00745 , timestamp =

  48. [48]

    1952 , journal =

    Use of Ranks in One-Criterion Variance Analysis , author =. 1952 , journal =

  49. [49]

    2024 , booktitle =

    Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models , author =. 2024 , booktitle =. doi:10.18653/v1/2024.emnlp-main.699 , url =

  50. [50]

    2023 , url =

    Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , author =. 2023 , url =. 2307.09458 , archiveprefix =

  51. [51]

    2023 , booktitle =

    Tracr: Compiled Transformers as a Laboratory for Interpretability , author =. 2023 , booktitle =

  52. [52]

    2025 , url =

    Distributed Specialization: Rare-Token Neurons in Large Language Models , author =. 2025 , url =. 2509.21163 , archiveprefix =

  53. [53]

    2025 , url =

    Repetitions are not all alike: distinct mechanisms sustain repetition in language models , author =. 2025 , url =. 2504.01100 , archiveprefix =

  54. [54]

    2024 , booktitle =

    Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching , author =. 2024 , booktitle =

  55. [55]

    1947 , journal =

    On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , author =. 1947 , journal =

  56. [56]

    1967 , journal =

    The Detection of Disease Clustering and a Generalized Regression Approach , author =. 1967 , journal =

  57. [57]

    2025 , booktitle =

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author =. 2025 , booktitle =

  58. [58]

    2024 , booktitle =

    Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads , author =. 2024 , booktitle =. doi:10.18653/v1/2024.blackboxnlp-1.22 , url =

  59. [59]

    2023 , url =

    The Hydra Effect: Emergent Self-repair in Language Model Computations , author =. 2023 , url =. 2307.15771 , archiveprefix =

  60. [60]

    2025 , booktitle =

    Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? , author =. 2025 , booktitle =

  61. [61]

    2025 , url =

    Mechanistic Interpretability as Statistical Estimation: A Variance Analysis , author =. 2025 , url =. 2510.00845 , archiveprefix =

  62. [62]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , year =. Locating and Editing Factual Associations in. Advances in Neural Information Processing Systems , volume =

  63. [63]

    2024 , booktitle =

    Circuit Component Reuse Across Tasks in Transformer Language Models , author =. 2024 , booktitle =

  64. [64]

    2025 , booktitle =

    On Linear Representations and Pretraining Data Frequency in Language Models , author =. 2025 , booktitle =

  65. [65]

    2024 , booktitle =

    Transformer Circuit Evaluation Metrics Are Not Robust , author =. 2024 , booktitle =

  66. [66]

    2025 , booktitle =

    Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models , author =. 2025 , booktitle =. doi:10.18653/v1/2025.acl-long.727 , url =

  67. [67]

    2025 , booktitle =

    Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv. 2025 , booktitle =

  68. [68]

    2024 , url =

    Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability , author =. 2024 , url =. 2411.16105 , archiveprefix =

  69. [69]

    Neel Nanda and Joseph Bloom , year =

  70. [70]

    2023 , booktitle =

    Progress measures for grokking via mechanistic interpretability , author =. 2023 , booktitle =

  71. [71]

    2023 , booktitle =

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. 2023 , booktitle =

  72. [72]

    2025 , booktitle =

    Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics , author =. 2025 , booktitle =

  73. [73]

    2025 , journal =

    Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning , author =. 2025 , journal =

  74. [74]

    2011 , journal =

    A theory of biological relativity: no privileged level of causation , author =. 2011 , journal =. doi:10.1098/rsfs.2011.0067 , url =

  75. [75]

    Interpreting

    nostalgebraist , year =. Interpreting

  76. [76]

    2024 , url =

    Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models , author =. 2024 , url =. 2405.12522 , archiveprefix =

  77. [77]

    Zoom in: An introduction to circuits

    Zoom In: An Introduction to Circuits , author =. 2020 , journal =. doi:10.23915/distill.00024.001 , note =

  78. [78]

    2022 , journal =

    Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases , author =. 2022 , journal =

  79. [79]

    2022 , journal =

    In-context Learning and Induction Heads , author =. 2022 , journal =

  80. [80]

    2024 , booktitle =

    Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals , author =. 2024 , booktitle =. doi:10.18653/v1/2024.acl-long.458 , url =

Showing first 80 references.