Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

Alireza Bayat Makou; Iryna Gurevych; Jingcheng Niu; Subhabrata Dutta

arxiv: 2606.06267 · v1 · pith:QGSTI6EJnew · submitted 2026-06-04 · 💻 cs.CL

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

Alireza Bayat Makou , Jingcheng Niu , Subhabrata Dutta , Iryna Gurevych This is my paper

Pith reviewed 2026-06-28 01:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords circuit discoverymechanistic interpretabilityphantom specializationlanguage modelssequence copyinginput variationedge evaluation

0 comments

The pith

Structurally distinct circuits implement the same computation when input frequency varies but the task is fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether structural differences in discovered circuits necessarily indicate distinct mechanisms by running circuit discovery on a literal sequence copying task across four token-frequency bands in Pythia models. It finds that the resulting circuits appear specialized by band yet prove functionally equivalent: edges from one band transfer to others, a shared core recovers at least 99 percent of performance, and causal interchange interventions show interchangeable internal representations. This pattern of phantom specialization arises because discovery algorithms sample from an equivalence class of valid subgraphs. Standard source-level evaluation hides the many-to-one structure-to-function mapping while edge-level evaluation and cross-band tests expose it.

Core claim

Structurally distinct circuits implement the same computation: band-specific edges transfer broadly across bands, a core shared across most bands recovers at least 99% of circuit performance, and causal interchange interventions confirm that internal representations are interchangeable across frequency bands.

What carries the argument

Phantom specialization: apparent structural differences between circuits that do not correspond to functional differences, exposed by cross-condition edge transfer and edge-level evaluation.

If this is right

Band-specific edges transfer broadly across frequency bands without loss of function.
A shared core across most bands recovers at least 99% of circuit performance.
Causal interchange interventions confirm interchangeable internal representations across bands.
Repeated extractions within one band sample from an equivalence class of valid subgraphs.
Edge-level evaluation reveals the many-to-one mapping while source-level evaluation inflates apparent faithfulness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interpretability work should require functional equivalence tests rather than relying on structural comparisons alone.
Circuit discovery methods may routinely return one of many equivalent subgraphs for a given behavior.
The same task may admit multiple valid circuits whose differences reflect sampling rather than distinct mechanisms.

Load-bearing premise

Varying token frequency while holding the literal sequence copying task fixed isolates the effect of input statistics on circuit structure.

What would settle it

An interchange intervention between circuits extracted from different frequency bands that fails to preserve task performance would falsify the claim of interchangeable representations.

Figures

Figures reproduced from arXiv: 2606.06267 by Alireza Bayat Makou, Iryna Gurevych, Jingcheng Niu, Subhabrata Dutta.

**Figure 2.** Figure 2: Two paradigms for circuit discovery. (a) Manual discovery follows a five-step workflow (Rai et al., 2024): select a target behavior, define the graph and granularity, localize important components via intervention, interpret their roles, and evaluate the result, iterating between localization and interpretation until a stable working hypothesis emerges (Olsson et al., 2022; Wang et al., 2023; Hanna et al.,… view at source ↗

**Figure 3.** Figure 3: LSC sequence structure and corruption. Top (clean): the model observes a source prefix S1–5 followed by target T, a distraction segment R1–10, and a repetition of the source prefix, then must predict T at the final position. All 16 tokens are sampled from the same frequency band. Bottom (corrupt): the repeated source prefix is replaced with random tokens X1–5 from the same frequency band, destroying the re… view at source ↗

**Figure 4.** Figure 4: The similarity triangle: three axes for comparing circuits, each probed both correlationally (metric [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Edge sharing spectrum by model. Each bar shows the fraction of edges appearing in exactly [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Circuit accuracy as a function of sharing threshold [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Logit lens trajectories: universal core (dashed) vs. full circuit (solid), with the base model (gray) [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Performance gap decomposition. The gap between the universal core (edge-level) and full circuit [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Simpson’s paradox in cross-perspective correlations. Universal edge fraction and circuit size fraction [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Common ways to construct the corrupt run for activation patching. Simpler ablations are [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗

**Figure 11.** Figure 11: Activation patching and its two main directions. [PITH_FULL_IMAGE:figures/full_fig_p045_11.png] view at source ↗

**Figure 12.** Figure 12: Data construction pipeline. Each stage (blue) transforms the output of the previous step into [PITH_FULL_IMAGE:figures/full_fig_p046_12.png] view at source ↗

**Figure 13.** Figure 13: Token vocabulary overview. Top left: category distribution; word_en tokens dominate the vocabulary. Top right: log-frequency distributions by category. Bottom left: word_en frequency distribution with percentile markers (p1, p25, p75, p99) defining the core range for band construction. Bottom right: character length distributions by category. Final scheme: eight conditions. The five core bands span the g… view at source ↗

**Figure 14.** Figure 14: Confound profile across frequency bands for word-validated pools. [PITH_FULL_IMAGE:figures/full_fig_p049_14.png] view at source ↗

**Figure 15.** Figure 15: Pareto sweep results for two representative Pythia models on the control band (top: 70m; bottom: [PITH_FULL_IMAGE:figures/full_fig_p051_15.png] view at source ↗

**Figure 16.** Figure 16: Same-band and cross-band accuracy of EAP-IG circuits as a function of circuit size (as a multiple [PITH_FULL_IMAGE:figures/full_fig_p054_16.png] view at source ↗

**Figure 17.** Figure 17: Transfer efficiency across all three circuit discovery methods (ACDC, EAP, EAP-IG) at the [PITH_FULL_IMAGE:figures/full_fig_p055_17.png] view at source ↗

**Figure 18.** Figure 18: Base model top-1 accuracy across frequency bands. Pythia-70m shows a clear frequency gradient; [PITH_FULL_IMAGE:figures/full_fig_p056_18.png] view at source ↗

**Figure 19.** Figure 19: Circuit size (edge fraction) across models and frequency bands. Within-model variation across [PITH_FULL_IMAGE:figures/full_fig_p056_19.png] view at source ↗

**Figure 20.** Figure 20: Base model, circuit, and ablation accuracy across models and bands. Ablation accuracy is at or [PITH_FULL_IMAGE:figures/full_fig_p057_20.png] view at source ↗

**Figure 21.** Figure 21: Cross-band transfer matrices for all five models (averaged over draws). Rows = training band, [PITH_FULL_IMAGE:figures/full_fig_p060_21.png] view at source ↗

**Figure 22.** Figure 22: Asymmetric transfer: low-frequency (LF) circuits evaluated on high-frequency (HF) data consis [PITH_FULL_IMAGE:figures/full_fig_p060_22.png] view at source ↗

**Figure 23.** Figure 23: Circuit properties as a function of model size. Larger models yield more faithful (higher accuracy, [PITH_FULL_IMAGE:figures/full_fig_p061_23.png] view at source ↗

**Figure 24.** Figure 24: Circuit accuracy versus circuit size (edge fraction) for all 75 circuits. Larger models achieve higher [PITH_FULL_IMAGE:figures/full_fig_p061_24.png] view at source ↗

**Figure 25.** Figure 25: Control circuit matches the accuracy and transfer of frequency-specific circuits. [PITH_FULL_IMAGE:figures/full_fig_p062_25.png] view at source ↗

**Figure 26.** Figure 26: Head participation rate by model and band. [PITH_FULL_IMAGE:figures/full_fig_p064_26.png] view at source ↗

**Figure 27.** Figure 27: Universal vs. band-specific edge counts per model. [PITH_FULL_IMAGE:figures/full_fig_p064_27.png] view at source ↗

**Figure 28.** Figure 28: Band affinity heatmaps per model [PITH_FULL_IMAGE:figures/full_fig_p066_28.png] view at source ↗

**Figure 29.** Figure 29: Embedding-layer representational metrics by model size. [PITH_FULL_IMAGE:figures/full_fig_p068_29.png] view at source ↗

**Figure 30.** Figure 30: Residual-stream representational trajectories across layers for all five Pythia models. [PITH_FULL_IMAGE:figures/full_fig_p068_30.png] view at source ↗

**Figure 31.** Figure 31: Logit lens convergence layer by model and frequency band. [PITH_FULL_IMAGE:figures/full_fig_p069_31.png] view at source ↗

**Figure 32.** Figure 32: Attention entropy by layer and model. metric (Olsson et al., 2022) measures attention to the token following a bigram repetition, whereas LSC uses a five-token prefix before the repeated segment begins. The longer prefix means that the two-token patternmatching heuristic does not fire, even though the heads perform the same underlying copy computation. BOS-sink heads dominate (35–63%); induction heads ar… view at source ↗

**Figure 33.** Figure 33: Fraction of frequency-selective MLP neurons by layer and model. [PITH_FULL_IMAGE:figures/full_fig_p070_33.png] view at source ↗

**Figure 34.** Figure 34: Information-theoretic layer-wise analyses across all five Pythia models. [PITH_FULL_IMAGE:figures/full_fig_p071_34.png] view at source ↗

**Figure 35.** Figure 35: Layer-wise representational metrics across all five Pythia models. (a) Logit lens convergence [PITH_FULL_IMAGE:figures/full_fig_p071_35.png] view at source ↗

**Figure 36.** Figure 36: Interchange intervention logit difference by layer (residual stream, prediction position). The [PITH_FULL_IMAGE:figures/full_fig_p072_36.png] view at source ↗

**Figure 37.** Figure 37: Full 5×5 IIA matrices at peak layer. Rows: base band; columns: source band. Near-uniform values indicate that the model uses a shared representational format across all bands. Position and component analysis. At the peak layer, sweeping patching position across all token positions shows that IIA is zero everywhere except the final prediction position (position 21), confirming that band-distinguishing inf… view at source ↗

**Figure 38.** Figure 38: Positive control: source IIA across layers for within-band different-target interchange patching. [PITH_FULL_IMAGE:figures/full_fig_p073_38.png] view at source ↗

**Figure 39.** Figure 39: Boundless DAS effective subspace dimension across models. Left: absolute dimension; right: [PITH_FULL_IMAGE:figures/full_fig_p074_39.png] view at source ↗

**Figure 40.** Figure 40: Quantitative similarity triangle. Mean absolute Spearman correlations computed over all metric [PITH_FULL_IMAGE:figures/full_fig_p075_40.png] view at source ↗

**Figure 41.** Figure 41: Hierarchically clustered correlation matrix of 26 unified metrics. Block structure aligns with [PITH_FULL_IMAGE:figures/full_fig_p077_41.png] view at source ↗

**Figure 42.** Figure 42: Scaling consistency across perspectives. Six representative metrics (two per perspective) plotted [PITH_FULL_IMAGE:figures/full_fig_p078_42.png] view at source ↗

**Figure 43.** Figure 43: Universal core accuracy across models and frequency bands. Retention decreases with model [PITH_FULL_IMAGE:figures/full_fig_p080_43.png] view at source ↗

**Figure 44.** Figure 44: Universal core accuracy vs. size-matched random edge sets. The universal core advantage is [PITH_FULL_IMAGE:figures/full_fig_p080_44.png] view at source ↗

**Figure 45.** Figure 45: Cross-band accuracy boost from band-specific edges. Each panel shows one model; rows are [PITH_FULL_IMAGE:figures/full_fig_p081_45.png] view at source ↗

**Figure 46.** Figure 46: Accuracy as a function of sharing threshold [PITH_FULL_IMAGE:figures/full_fig_p081_46.png] view at source ↗

**Figure 47.** Figure 47: Critical sharing threshold per model and band. Most configurations reach 95% recovery at [PITH_FULL_IMAGE:figures/full_fig_p082_47.png] view at source ↗

**Figure 48.** Figure 48: Edge sharing distribution by model. Smaller models have a higher fraction of universal (5-band) [PITH_FULL_IMAGE:figures/full_fig_p083_48.png] view at source ↗

**Figure 49.** Figure 49: Attention head role distribution by universality class. [PITH_FULL_IMAGE:figures/full_fig_p083_49.png] view at source ↗

**Figure 50.** Figure 50: Necessity and sufficiency matrix. The universal core is both sufficient and necessary (complement [PITH_FULL_IMAGE:figures/full_fig_p086_50.png] view at source ↗

**Figure 51.** Figure 51: Logit lens trajectory overlay: universal core vs. full circuit. The universal core follows the same [PITH_FULL_IMAGE:figures/full_fig_p087_51.png] view at source ↗

**Figure 52.** Figure 52: Gap decomposition across sharing tiers. Including [PITH_FULL_IMAGE:figures/full_fig_p087_52.png] view at source ↗

**Figure 53.** Figure 53: Cross-band accuracy under resample (top) vs. zero ablation (bottom). The same-band advantage [PITH_FULL_IMAGE:figures/full_fig_p088_53.png] view at source ↗

read the original abstract

Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by varying input statistics while holding the task fixed, and show that the resulting structural differences exhibit apparent specialization but do not correspond to functional differences, a pattern we term phantom specialization. Using Literal Sequence Copying across four token-frequency bands plus a control condition in five Pythia models (70M-1.4B), we extract 75 circuits and find that structurally distinct circuits implement the same computation: band-specific edges transfer broadly across bands, a core shared across most bands recovers at least 99% of circuit performance, and causal interchange interventions confirm that internal representations are interchangeable across frequency bands. Repeated extractions within the same frequency band further suggest that discovery algorithms sample from an equivalence class of valid subgraphs rather than recovering a unique mechanism. Standard evaluation practice obscures this pattern: source-level evaluation inflates apparent faithfulness, while edge-level evaluation reveals the many-to-one mapping from structure to function. Our results show that structural differences between circuits are not sufficient evidence for distinct mechanisms, and that exposing this requires edge-level evaluation and cross-condition transfer tests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Structurally different circuits for the same copying task turn out to be functionally interchangeable across input frequency bands, backed by transfer and interchange tests.

read the letter

The key point is that this work shows circuit discovery can yield structurally distinct subgraphs that implement the same function when input statistics change but the task stays fixed. They call it phantom specialization and back it with three converging checks on Pythia models doing literal sequence copying.

They extract 75 circuits across four frequency bands plus control, then test cross-band edge transfer, recover a shared core that hits at least 99% of per-band performance, and run causal interchange interventions showing representations swap without loss. Within-band repeats also vary, suggesting the method samples from an equivalence class rather than a unique mechanism. Edge-level evaluation is what makes the many-to-one mapping visible; source-level masks it.

This is new because prior work treated structural differences as direct evidence of distinct mechanisms without these transfer and interchange controls. The design isolates input distribution reasonably well by holding the copying task constant. The experiments are empirical and avoid circularity.

The main limitation is the narrow task—literal copying is simple and may not capture how circuits behave on more open-ended language work, so generalization is an open question. The abstract gives clear numbers but full statistical details and raw data would help judge robustness. Minor variations in model scale or band definition could matter.

This is for mechanistic interpretability researchers who extract and interpret circuits. Readers who assume one circuit equals one mechanism will find the transfer results useful. It deserves peer review because the tests directly address a common assumption with falsifiable interventions rather than just reinterpretation.

Referee Report

2 major / 3 minor

Summary. The paper claims that circuit discovery methods yield structurally distinct subgraphs for the same task (literal sequence copying) when input token-frequency statistics are varied across four bands plus control, but these differences reflect 'phantom specialization' rather than distinct mechanisms. Using 75 circuits extracted from five Pythia models (70M–1.4B), the authors show that band-specific edges transfer broadly, a shared core across most bands recovers ≥99% of per-band circuit performance, and causal interchange interventions demonstrate that internal representations are interchangeable across bands. Repeated within-band extractions suggest discovery algorithms sample from an equivalence class of valid subgraphs. The work argues that source-level evaluation inflates faithfulness while edge-level evaluation reveals the many-to-one structure-to-function mapping, implying structural differences alone are insufficient evidence for distinct mechanisms.

Significance. If the empirical results hold, the paper makes a substantive contribution to mechanistic interpretability by providing converging evidence (cross-band transfer, shared-core recovery, and interchange interventions) that challenges the common assumption that structural variation in discovered circuits implies functional specialization. The clean isolation of input statistics while holding the task fixed, combined with the emphasis on evaluation granularity, offers a practical methodological caution and supports the view that circuits may belong to equivalence classes rather than unique mechanisms.

major comments (2)

[§4.2] §4.2 (cross-band transfer results): the reported broad transfer of band-specific edges is central to the phantom-specialization claim, yet the manuscript does not report the exact threshold used to classify an edge as 'transferring' or the statistical test for whether transfer rates differ significantly from within-band baselines.
[§5] §5 (interchange intervention protocol): the claim that representations are interchangeable rests on the interchange tests recovering performance; however, the description does not specify how the source and target activations are aligned when the circuits differ in edge sets, which is load-bearing for interpreting the results as evidence of functional equivalence.

minor comments (3)

[§3.1] The abstract and §3.1 refer to '75 circuits' but do not state how many independent extractions were performed per band-model pair; this detail would clarify the within-band variability analysis.
[Figure 3] Figure 3 (edge-level vs. source-level faithfulness) would benefit from error bars or per-model scatter to show consistency of the inflation effect across the five model sizes.
[§3] Notation for the four frequency bands is introduced in §3 but the exact token-frequency cutoffs are only given in an appendix; moving the definition to the main text would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the positive assessment and constructive comments. We address each major comment below and will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (cross-band transfer results): the reported broad transfer of band-specific edges is central to the phantom-specialization claim, yet the manuscript does not report the exact threshold used to classify an edge as 'transferring' or the statistical test for whether transfer rates differ significantly from within-band baselines.

Authors: We agree that these details are necessary for reproducibility. We will revise §4.2 to explicitly report the threshold used to classify an edge as transferring and to include the statistical test (with results) comparing cross-band transfer rates to within-band baselines. revision: yes
Referee: [§5] §5 (interchange intervention protocol): the claim that representations are interchangeable rests on the interchange tests recovering performance; however, the description does not specify how the source and target activations are aligned when the circuits differ in edge sets, which is load-bearing for interpreting the results as evidence of functional equivalence.

Authors: We thank the referee for highlighting this omission. We will revise §5 to specify the alignment procedure for activations (mapping by layer and component indices in the computational graph) when source and target circuits have differing edge sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study that extracts circuits via standard methods, measures cross-band transfer, shared-core performance, and interchange interventions on fixed-task inputs with varied statistics. No derivation, equation, or first-principles claim is present that could reduce to fitted parameters, self-definitions, or self-citation chains. All load-bearing evidence consists of direct experimental measurements (edge transfer rates, faithfulness scores >=99%, interchange success) that are falsifiable outside any internal fit. The design isolates input distribution while holding task fixed, and results are reported via standard evaluation metrics without renaming or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper is empirical and rests on standard assumptions from the mechanistic interpretability literature; it introduces a descriptive term but no new mathematical entities or fitted parameters.

axioms (1)

domain assumption Circuit discovery methods identify subgraphs that explain specific model behaviors
Foundational premise of the field that the paper tests by varying inputs.

invented entities (1)

phantom specialization no independent evidence
purpose: Descriptive label for the pattern where structural differences do not correspond to functional differences
New term introduced to name the observed phenomenon; no independent evidence provided beyond the experiments.

pith-pipeline@v0.9.1-grok · 5757 in / 1312 out tokens · 50084 ms · 2026-06-28T01:45:32.374931+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

113 extracted references · 31 canonical work pages · 2 internal anchors

[1]

An argument for divine providence, taken from the constant regularity observ'd in the births of both sexes

II. An argument for divine providence, taken from the constant regularity observ'd in the births of both sexes. By Dr. John Arbuthnott, Physitian in Ordinary to Her Majesty, and Fellow of the College of Physitians and the Royal Society , author =. 1710 , journal =. doi:10.1098/rstl.1710.0011 , url =

work page doi:10.1098/rstl.1710.0011
[2]

2025 , booktitle =

On Mechanistic Circuits for Extractive Question-Answering , author =. 2025 , booktitle =

2025
[3]

2023 , url =

Eliciting Latent Predictions from Transformers with the Tuned Lens , author =. 2023 , url =. 2303.08112 , archiveprefix =

Pith/arXiv arXiv 2023
[4]

1995 , journal =

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing , author =. 1995 , journal =

1995
[5]

2024 , booktitle =

Finding Transformer Circuits With Edge Pruning , author =. 2024 , booktitle =. doi:10.52202/079017-0587 , url =

work page doi:10.52202/079017-0587 2024
[6]

Stella Biderman and Hailey Schoelkopf and Quentin Gregory Anthony and Herbie Bradley and Kyle O'Brien and Eric Hallahan and Mohammad Aflah Khan and Shivanshu Purohit and USVSN Sai Prashanth and Edward Raff and Aviya Skowron and Lintang Sutawika and Oskar van der Wal , year =. Pythia:. International Conference on Machine Learning,
[7]

Tolga Bolukbasi and Adam Pearce and Ann Yuan and Andy Coenen and Emily Reif and Fernanda B. Vi. An Interpretability Illusion for. 2021 , journal =. 2104.07143 , timestamp =

arXiv 2021
[8]

2024 , booktitle =

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability , author =. 2024 , booktitle =

2024
[9]

2022 , journal =

Causal scrubbing, a method for rigorously testing interpretability hypotheses , author =. 2022 , journal =

2022
[10]

2023 , booktitle =

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations , author =. 2023 , booktitle =

2023
[11]

2013 , publisher =

Statistical Power Analysis for the Behavioral Sciences , author =. 2013 , publisher =

2013
[12]

2025 , booktitle =

Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning , author =. 2025 , booktitle =. doi:10.18653/v1/2025.findings-naacl.283 , url =

work page doi:10.18653/v1/2025.findings-naacl.283 2025
[13]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , year =. Proceedings of the 2019 Conference of the North. doi:10.18653/v1/N19-1423 , url =

work page doi:10.18653/v1/n19-1423 2019
[14]

Transcoders find interpretable

Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , year =. Transcoders find interpretable. Advances in Neural Information Processing Systems , volume =. doi:10.52202/079017-0768 , url =

work page doi:10.52202/079017-0768
[15]

2024 , journal =

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning , author =. 2024 , journal =

2024
[16]

2001 , journal =

Degeneracy and complexity in biological systems , author =. 2001 , journal =. doi:10.1073/pnas.231499798 , url =. https://www.pnas.org/doi/pdf/10.1073/pnas.231499798 , abstract =

work page doi:10.1073/pnas.231499798 2001
[17]

2024 , booktitle =

The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains , author =. 2024 , booktitle =

2024
[18]

1987 , journal =

Better Bootstrap Confidence Intervals , author =. 1987 , journal =

1987
[19]

2021 , journal =

A Mathematical Framework for Transformer Circuits , author =. 2021 , journal =

2021
[20]

2022 , journal =

Toy Models of Superposition , author =. 2022 , journal =

2022
[21]

2024 , booktitle =

On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task , author =. 2024 , booktitle =. doi:10.18653/v1/2024.findings-emnlp.591 , url =

work page doi:10.18653/v1/2024.findings-emnlp.591 2024
[22]

doi: 10.18653/v1/2021.acl-long.144

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models , author =. 2021 , booktitle =. doi:10.18653/v1/2021.acl-long.144 , url =

work page doi:10.18653/v1/2021.acl-long.144 2021
[23]

1922 , journal =

On the Interpretation of ^2 from Contingency Tables, and the Calculation of P , author =. 1922 , journal =

1922
[24]

1966 , publisher =

The Design of Experiments , author =. 1966 , publisher =

1966
[25]

2026 , url =

Finding Interpretable Prompt-Specific Circuits in Language Models , author =. 2026 , url =. 2602.13483 , archiveprefix =

Pith/arXiv arXiv 2026
[26]

2021 , journal =

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author =. 2021 , journal =. 2101.00027 , timestamp =

Pith/arXiv arXiv 2021
[27]

How does

Jorge Garc. How does. 2024 , booktitle =

2024
[28]

2024 , journal =

Adversarial Circuit Evaluation , author =. 2024 , journal =. doi:10.48550/ARXIV.2407.15166 , url =. 2407.15166 , timestamp =

work page doi:10.48550/arxiv.2407.15166 2024
[29]

Goodman and Christopher Potts and Thomas Icard , year =

Atticus Geiger and Duligur Ibeling and Amir Zur and Maheep Chaudhary and Sonakshi Chauhan and Jing Huang and Aryaman Arora and Zhengxuan Wu and Noah D. Goodman and Christopher Potts and Thomas Icard , year =. Causal Abstraction:. J. Mach. Learn. Res. , volume =
[30]

2024 , booktitle =

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author =. 2024 , booktitle =

2024
[31]

Localizing Model Behavior with Path Patching

Localizing Model Behavior with Path Patching , author =. 2023 , journal =. doi:10.48550/ARXIV.2304.05969 , url =. 2304.05969 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.05969 2023
[32]

2018 , booktitle =

FRAGE: Frequency-Agnostic Word Representation , author =. 2018 , booktitle =

2018
[33]

Gould, S. J. and Lewontin, R. C. , year =. The spandrels of. Proceedings of the Royal Society of London. B. Biological Sciences , volume =. doi:10.1098/rspb.1979.0086 , url =

work page doi:10.1098/rspb.1979.0086 1979
[34]

Wang, Ben and Komatsuzaki, Aran , year =
[35]

GPT - N eo X -20 B : An Open-Source Autoregressive Language Model

Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , year =. Proceedings of BigScience E...

work page doi:10.18653/v1/2022.bigscience-1.9 2022
[36]

Proceedings of the 62nd

Groeneveld, Dirk and Beltagy, Iz and Walsh, Evan and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack and Khot, Tus...

work page doi:10.18653/v1/2024.acl-long.841 2024
[37]

2025 , booktitle =

Position-aware Automatic Circuit Discovery , author =. 2025 , booktitle =. doi:10.18653/v1/2025.acl-long.141 , url =

work page doi:10.18653/v1/2025.acl-long.141 2025
[38]

How does

Michael Hanna and Ollie Liu and Alexandre Variengien , year =. How does. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , url =

2023
[39]

2024 , booktitle =

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms , author =. 2024 , booktitle =

2024
[40]

A circuit for

Heimersheim, Stefan and Janiak, Jett , year =. A circuit for
[41]

How to use and interpret activation patching

How to use and interpret activation patching , author =. 2024 , journal =. doi:10.48550/ARXIV.2404.15255 , url =. 2404.15255 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.15255 2024
[42]

Quanti- fying causal emergence shows that macro can beat micro

Quantifying causal emergence shows that macro can beat micro , author =. 2013 , journal =. doi:10.1073/pnas.1314922110 , url =. https://www.pnas.org/doi/pdf/10.1073/pnas.1314922110 , abstract =

work page doi:10.1073/pnas.1314922110 2013
[43]

2024 , booktitle =

Successor Heads: Recurring, Interpretable Attention Heads In The Wild , author =. 2024 , booktitle =

2024
[44]

Bulletin de la Soci

Jaccard, Paul , year =. Bulletin de la Soci
[45]

1954 , journal =

A Distribution-Free k-Sample Test Against Ordered Alternatives , author =. 1954 , journal =

1954
[46]

2019 , booktitle =

Similarity of Neural Network Representations Revisited , author =. 2019 , booktitle =

2019
[47]

Atp*: An eﬀicient and scalable method for localizing llm behaviour to components

J. AtP*: An efficient and scalable method for localizing. 2024 , journal =. doi:10.48550/ARXIV.2403.00745 , url =. 2403.00745 , timestamp =

work page doi:10.48550/arxiv.2403.00745 2024
[48]

1952 , journal =

Use of Ranks in One-Criterion Variance Analysis , author =. 1952 , journal =

1952
[49]

2024 , booktitle =

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models , author =. 2024 , booktitle =. doi:10.18653/v1/2024.emnlp-main.699 , url =

work page doi:10.18653/v1/2024.emnlp-main.699 2024
[50]

2023 , url =

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , author =. 2023 , url =. 2307.09458 , archiveprefix =

arXiv 2023
[51]

2023 , booktitle =

Tracr: Compiled Transformers as a Laboratory for Interpretability , author =. 2023 , booktitle =

2023
[52]

2025 , url =

Distributed Specialization: Rare-Token Neurons in Large Language Models , author =. 2025 , url =. 2509.21163 , archiveprefix =

arXiv 2025
[53]

2025 , url =

Repetitions are not all alike: distinct mechanisms sustain repetition in language models , author =. 2025 , url =. 2504.01100 , archiveprefix =

arXiv 2025
[54]

2024 , booktitle =

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching , author =. 2024 , booktitle =

2024
[55]

1947 , journal =

On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , author =. 1947 , journal =

1947
[56]

1967 , journal =

The Detection of Disease Clustering and a Generalized Regression Approach , author =. 1967 , journal =

1967
[57]

2025 , booktitle =

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author =. 2025 , booktitle =

2025
[58]

2024 , booktitle =

Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads , author =. 2024 , booktitle =. doi:10.18653/v1/2024.blackboxnlp-1.22 , url =

work page doi:10.18653/v1/2024.blackboxnlp-1.22 2024
[59]

2023 , url =

The Hydra Effect: Emergent Self-repair in Language Model Computations , author =. 2023 , url =. 2307.15771 , archiveprefix =

arXiv 2023
[60]

2025 , booktitle =

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? , author =. 2025 , booktitle =

2025
[61]

2025 , url =

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis , author =. 2025 , url =. 2510.00845 , archiveprefix =

Pith/arXiv arXiv 2025
[62]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , year =. Locating and Editing Factual Associations in. Advances in Neural Information Processing Systems , volume =
[63]

2024 , booktitle =

Circuit Component Reuse Across Tasks in Transformer Language Models , author =. 2024 , booktitle =

2024
[64]

2025 , booktitle =

On Linear Representations and Pretraining Data Frequency in Language Models , author =. 2025 , booktitle =

2025
[65]

2024 , booktitle =

Transformer Circuit Evaluation Metrics Are Not Robust , author =. 2024 , booktitle =

2024
[66]

2025 , booktitle =

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models , author =. 2025 , booktitle =. doi:10.18653/v1/2025.acl-long.727 , url =

work page doi:10.18653/v1/2025.acl-long.727 2025
[67]

2025 , booktitle =

Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv. 2025 , booktitle =

2025
[68]

2024 , url =

Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability , author =. 2024 , url =. 2411.16105 , archiveprefix =

arXiv 2024
[69]

Neel Nanda and Joseph Bloom , year =
[70]

2023 , booktitle =

Progress measures for grokking via mechanistic interpretability , author =. 2023 , booktitle =

2023
[71]

2023 , booktitle =

Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. 2023 , booktitle =

2023
[72]

2025 , booktitle =

Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics , author =. 2025 , booktitle =

2025
[73]

2025 , journal =

Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning , author =. 2025 , journal =

2025
[74]

2011 , journal =

A theory of biological relativity: no privileged level of causation , author =. 2011 , journal =. doi:10.1098/rsfs.2011.0067 , url =

work page doi:10.1098/rsfs.2011.0067 2011
[75]

Interpreting

nostalgebraist , year =. Interpreting
[76]

2024 , url =

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models , author =. 2024 , url =. 2405.12522 , archiveprefix =

arXiv 2024
[77]

Zoom in: An introduction to circuits

Zoom In: An Introduction to Circuits , author =. 2020 , journal =. doi:10.23915/distill.00024.001 , note =

work page doi:10.23915/distill.00024.001 2020
[78]

2022 , journal =

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases , author =. 2022 , journal =

2022
[79]

2022 , journal =

In-context Learning and Induction Heads , author =. 2022 , journal =

2022
[80]

2024 , booktitle =

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals , author =. 2024 , booktitle =. doi:10.18653/v1/2024.acl-long.458 , url =

work page doi:10.18653/v1/2024.acl-long.458 2024

Showing first 80 references.

[1] [1]

An argument for divine providence, taken from the constant regularity observ'd in the births of both sexes

II. An argument for divine providence, taken from the constant regularity observ'd in the births of both sexes. By Dr. John Arbuthnott, Physitian in Ordinary to Her Majesty, and Fellow of the College of Physitians and the Royal Society , author =. 1710 , journal =. doi:10.1098/rstl.1710.0011 , url =

work page doi:10.1098/rstl.1710.0011

[2] [2]

2025 , booktitle =

On Mechanistic Circuits for Extractive Question-Answering , author =. 2025 , booktitle =

2025

[3] [3]

2023 , url =

Eliciting Latent Predictions from Transformers with the Tuned Lens , author =. 2023 , url =. 2303.08112 , archiveprefix =

Pith/arXiv arXiv 2023

[4] [4]

1995 , journal =

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing , author =. 1995 , journal =

1995

[5] [5]

2024 , booktitle =

Finding Transformer Circuits With Edge Pruning , author =. 2024 , booktitle =. doi:10.52202/079017-0587 , url =

work page doi:10.52202/079017-0587 2024

[6] [6]

Stella Biderman and Hailey Schoelkopf and Quentin Gregory Anthony and Herbie Bradley and Kyle O'Brien and Eric Hallahan and Mohammad Aflah Khan and Shivanshu Purohit and USVSN Sai Prashanth and Edward Raff and Aviya Skowron and Lintang Sutawika and Oskar van der Wal , year =. Pythia:. International Conference on Machine Learning,

[7] [7]

Tolga Bolukbasi and Adam Pearce and Ann Yuan and Andy Coenen and Emily Reif and Fernanda B. Vi. An Interpretability Illusion for. 2021 , journal =. 2104.07143 , timestamp =

arXiv 2021

[8] [8]

2024 , booktitle =

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability , author =. 2024 , booktitle =

2024

[9] [9]

2022 , journal =

Causal scrubbing, a method for rigorously testing interpretability hypotheses , author =. 2022 , journal =

2022

[10] [10]

2023 , booktitle =

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations , author =. 2023 , booktitle =

2023

[11] [11]

2013 , publisher =

Statistical Power Analysis for the Behavioral Sciences , author =. 2013 , publisher =

2013

[12] [12]

2025 , booktitle =

Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning , author =. 2025 , booktitle =. doi:10.18653/v1/2025.findings-naacl.283 , url =

work page doi:10.18653/v1/2025.findings-naacl.283 2025

[13] [13]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , year =. Proceedings of the 2019 Conference of the North. doi:10.18653/v1/N19-1423 , url =

work page doi:10.18653/v1/n19-1423 2019

[14] [14]

Transcoders find interpretable

Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , year =. Transcoders find interpretable. Advances in Neural Information Processing Systems , volume =. doi:10.52202/079017-0768 , url =

work page doi:10.52202/079017-0768

[15] [15]

2024 , journal =

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning , author =. 2024 , journal =

2024

[16] [16]

2001 , journal =

Degeneracy and complexity in biological systems , author =. 2001 , journal =. doi:10.1073/pnas.231499798 , url =. https://www.pnas.org/doi/pdf/10.1073/pnas.231499798 , abstract =

work page doi:10.1073/pnas.231499798 2001

[17] [17]

2024 , booktitle =

The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains , author =. 2024 , booktitle =

2024

[18] [18]

1987 , journal =

Better Bootstrap Confidence Intervals , author =. 1987 , journal =

1987

[19] [19]

2021 , journal =

A Mathematical Framework for Transformer Circuits , author =. 2021 , journal =

2021

[20] [20]

2022 , journal =

Toy Models of Superposition , author =. 2022 , journal =

2022

[21] [21]

2024 , booktitle =

On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task , author =. 2024 , booktitle =. doi:10.18653/v1/2024.findings-emnlp.591 , url =

work page doi:10.18653/v1/2024.findings-emnlp.591 2024

[22] [22]

doi: 10.18653/v1/2021.acl-long.144

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models , author =. 2021 , booktitle =. doi:10.18653/v1/2021.acl-long.144 , url =

work page doi:10.18653/v1/2021.acl-long.144 2021

[23] [23]

1922 , journal =

On the Interpretation of ^2 from Contingency Tables, and the Calculation of P , author =. 1922 , journal =

1922

[24] [24]

1966 , publisher =

The Design of Experiments , author =. 1966 , publisher =

1966

[25] [25]

2026 , url =

Finding Interpretable Prompt-Specific Circuits in Language Models , author =. 2026 , url =. 2602.13483 , archiveprefix =

Pith/arXiv arXiv 2026

[26] [26]

2021 , journal =

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author =. 2021 , journal =. 2101.00027 , timestamp =

Pith/arXiv arXiv 2021

[27] [27]

How does

Jorge Garc. How does. 2024 , booktitle =

2024

[28] [28]

2024 , journal =

Adversarial Circuit Evaluation , author =. 2024 , journal =. doi:10.48550/ARXIV.2407.15166 , url =. 2407.15166 , timestamp =

work page doi:10.48550/arxiv.2407.15166 2024

[29] [29]

Goodman and Christopher Potts and Thomas Icard , year =

Atticus Geiger and Duligur Ibeling and Amir Zur and Maheep Chaudhary and Sonakshi Chauhan and Jing Huang and Aryaman Arora and Zhengxuan Wu and Noah D. Goodman and Christopher Potts and Thomas Icard , year =. Causal Abstraction:. J. Mach. Learn. Res. , volume =

[30] [30]

2024 , booktitle =

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author =. 2024 , booktitle =

2024

[31] [31]

Localizing Model Behavior with Path Patching

Localizing Model Behavior with Path Patching , author =. 2023 , journal =. doi:10.48550/ARXIV.2304.05969 , url =. 2304.05969 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.05969 2023

[32] [32]

2018 , booktitle =

FRAGE: Frequency-Agnostic Word Representation , author =. 2018 , booktitle =

2018

[33] [33]

Gould, S. J. and Lewontin, R. C. , year =. The spandrels of. Proceedings of the Royal Society of London. B. Biological Sciences , volume =. doi:10.1098/rspb.1979.0086 , url =

work page doi:10.1098/rspb.1979.0086 1979

[34] [34]

Wang, Ben and Komatsuzaki, Aran , year =

[35] [35]

GPT - N eo X -20 B : An Open-Source Autoregressive Language Model

Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , year =. Proceedings of BigScience E...

work page doi:10.18653/v1/2022.bigscience-1.9 2022

[36] [36]

Proceedings of the 62nd

Groeneveld, Dirk and Beltagy, Iz and Walsh, Evan and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack and Khot, Tus...

work page doi:10.18653/v1/2024.acl-long.841 2024

[37] [37]

2025 , booktitle =

Position-aware Automatic Circuit Discovery , author =. 2025 , booktitle =. doi:10.18653/v1/2025.acl-long.141 , url =

work page doi:10.18653/v1/2025.acl-long.141 2025

[38] [38]

How does

Michael Hanna and Ollie Liu and Alexandre Variengien , year =. How does. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , url =

2023

[39] [39]

2024 , booktitle =

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms , author =. 2024 , booktitle =

2024

[40] [40]

A circuit for

Heimersheim, Stefan and Janiak, Jett , year =. A circuit for

[41] [41]

How to use and interpret activation patching

How to use and interpret activation patching , author =. 2024 , journal =. doi:10.48550/ARXIV.2404.15255 , url =. 2404.15255 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.15255 2024

[42] [42]

Quanti- fying causal emergence shows that macro can beat micro

Quantifying causal emergence shows that macro can beat micro , author =. 2013 , journal =. doi:10.1073/pnas.1314922110 , url =. https://www.pnas.org/doi/pdf/10.1073/pnas.1314922110 , abstract =

work page doi:10.1073/pnas.1314922110 2013

[43] [43]

2024 , booktitle =

Successor Heads: Recurring, Interpretable Attention Heads In The Wild , author =. 2024 , booktitle =

2024

[44] [44]

Bulletin de la Soci

Jaccard, Paul , year =. Bulletin de la Soci

[45] [45]

1954 , journal =

A Distribution-Free k-Sample Test Against Ordered Alternatives , author =. 1954 , journal =

1954

[46] [46]

2019 , booktitle =

Similarity of Neural Network Representations Revisited , author =. 2019 , booktitle =

2019

[47] [47]

Atp*: An eﬀicient and scalable method for localizing llm behaviour to components

J. AtP*: An efficient and scalable method for localizing. 2024 , journal =. doi:10.48550/ARXIV.2403.00745 , url =. 2403.00745 , timestamp =

work page doi:10.48550/arxiv.2403.00745 2024

[48] [48]

1952 , journal =

Use of Ranks in One-Criterion Variance Analysis , author =. 1952 , journal =

1952

[49] [49]

2024 , booktitle =

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models , author =. 2024 , booktitle =. doi:10.18653/v1/2024.emnlp-main.699 , url =

work page doi:10.18653/v1/2024.emnlp-main.699 2024

[50] [50]

2023 , url =

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , author =. 2023 , url =. 2307.09458 , archiveprefix =

arXiv 2023

[51] [51]

2023 , booktitle =

Tracr: Compiled Transformers as a Laboratory for Interpretability , author =. 2023 , booktitle =

2023

[52] [52]

2025 , url =

Distributed Specialization: Rare-Token Neurons in Large Language Models , author =. 2025 , url =. 2509.21163 , archiveprefix =

arXiv 2025

[53] [53]

2025 , url =

Repetitions are not all alike: distinct mechanisms sustain repetition in language models , author =. 2025 , url =. 2504.01100 , archiveprefix =

arXiv 2025

[54] [54]

2024 , booktitle =

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching , author =. 2024 , booktitle =

2024

[55] [55]

1947 , journal =

On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , author =. 1947 , journal =

1947

[56] [56]

1967 , journal =

The Detection of Disease Clustering and a Generalized Regression Approach , author =. 1967 , journal =

1967

[57] [57]

2025 , booktitle =

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author =. 2025 , booktitle =

2025

[58] [58]

2024 , booktitle =

Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads , author =. 2024 , booktitle =. doi:10.18653/v1/2024.blackboxnlp-1.22 , url =

work page doi:10.18653/v1/2024.blackboxnlp-1.22 2024

[59] [59]

2023 , url =

The Hydra Effect: Emergent Self-repair in Language Model Computations , author =. 2023 , url =. 2307.15771 , archiveprefix =

arXiv 2023

[60] [60]

2025 , booktitle =

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? , author =. 2025 , booktitle =

2025

[61] [61]

2025 , url =

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis , author =. 2025 , url =. 2510.00845 , archiveprefix =

Pith/arXiv arXiv 2025

[62] [62]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , year =. Locating and Editing Factual Associations in. Advances in Neural Information Processing Systems , volume =

[63] [63]

2024 , booktitle =

Circuit Component Reuse Across Tasks in Transformer Language Models , author =. 2024 , booktitle =

2024

[64] [64]

2025 , booktitle =

On Linear Representations and Pretraining Data Frequency in Language Models , author =. 2025 , booktitle =

2025

[65] [65]

2024 , booktitle =

Transformer Circuit Evaluation Metrics Are Not Robust , author =. 2024 , booktitle =

2024

[66] [66]

2025 , booktitle =

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models , author =. 2025 , booktitle =. doi:10.18653/v1/2025.acl-long.727 , url =

work page doi:10.18653/v1/2025.acl-long.727 2025

[67] [67]

2025 , booktitle =

Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv. 2025 , booktitle =

2025

[68] [68]

2024 , url =

Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability , author =. 2024 , url =. 2411.16105 , archiveprefix =

arXiv 2024

[69] [69]

Neel Nanda and Joseph Bloom , year =

[70] [70]

2023 , booktitle =

Progress measures for grokking via mechanistic interpretability , author =. 2023 , booktitle =

2023

[71] [71]

2023 , booktitle =

Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. 2023 , booktitle =

2023

[72] [72]

2025 , booktitle =

Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics , author =. 2025 , booktitle =

2025

[73] [73]

2025 , journal =

Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning , author =. 2025 , journal =

2025

[74] [74]

2011 , journal =

A theory of biological relativity: no privileged level of causation , author =. 2011 , journal =. doi:10.1098/rsfs.2011.0067 , url =

work page doi:10.1098/rsfs.2011.0067 2011

[75] [75]

Interpreting

nostalgebraist , year =. Interpreting

[76] [76]

2024 , url =

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models , author =. 2024 , url =. 2405.12522 , archiveprefix =

arXiv 2024

[77] [77]

Zoom in: An introduction to circuits

Zoom In: An Introduction to Circuits , author =. 2020 , journal =. doi:10.23915/distill.00024.001 , note =

work page doi:10.23915/distill.00024.001 2020

[78] [78]

2022 , journal =

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases , author =. 2022 , journal =

2022

[79] [79]

2022 , journal =

In-context Learning and Induction Heads , author =. 2022 , journal =

2022

[80] [80]

2024 , booktitle =

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals , author =. 2024 , booktitle =. doi:10.18653/v1/2024.acl-long.458 , url =

work page doi:10.18653/v1/2024.acl-long.458 2024