arxiv: 2605.08348 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

Michael Li , Nishant Subramani

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords mechanistic interpretabilitycircuitslanguage modelsablationtask specificityedge attribution patchingmodel components

0 comments

The pith

Circuits found for one language model task largely overlap with those for other tasks, so ablating one harms performance on the others nearly as much as on its own.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether circuits, the sparse subgraphs of model components that drive task performance, are consistent within a task and unique to it. It finds high reuse and necessity of components inside each task's circuit, with ablations dropping accuracy sharply. Across tasks, however, the circuits share most components, and those shared parts prove causally important for every task tested. Only a modest fraction of any circuit is task-specific, and removing it changes little. This pattern questions how precisely circuits can isolate or edit one behavior without affecting others.

Core claim

Using edge attribution patching on six tasks and seven models, the authors extract per-example circuits and measure reuse, consistency, and specificity. Within-task reuse is high and shared components are necessary, producing up to 100 percent relative accuracy drops when ablated. Across tasks the circuits overlap substantially; ablating one task's circuit damages another task's performance about as much as ablating that task's own circuit. The remaining task-specific components account for only a modest share of the circuit's effect.

What carries the argument

Edge attribution patching to extract circuits, followed by measurements of within-task reuse, consistency of components across examples, and cross-task specificity via ablation comparisons.

If this is right

Shared components within a task's circuit are necessary for that task's performance.
Most components in any circuit are causally relevant to multiple tasks rather than one.
Task-specific components exist but contribute only a small fraction of overall circuit performance.
Circuit discovery at the level of attention heads and MLP layers identifies important components but does not isolate task-unique mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If circuits largely reflect general computational patterns rather than task-specific ones, then targeted interventions on single behaviors may be harder than hoped.
Finer-grained analysis below the head and layer level might reveal greater specificity that the current circuits obscure.
Models may solve superficially different tasks by reusing the same core subroutines, which would explain the observed overlap.

Load-bearing premise

The circuits found by edge attribution patching contain the main causal parts that drive each task, and the chosen tasks are different enough that their circuits ought to be distinct if circuits are truly task-specific.

What would settle it

If ablating a circuit extracted for task A produced far smaller accuracy drops on task B than ablating task B's own circuit, that would contradict the finding of substantial causal overlap.

Figures

Figures reproduced from arXiv: 2605.08348 by Michael Li, Nishant Subramani.

**Figure 1.** Figure 1: Circuit evaluation criteria. We propose evaluating circuits for consistency across inputs and specificity across tasks, in addition to necessity and sufficiency. reveal that a small number of important task-specific components do exist, but the bulk of each circuit is shared across tasks. These findings suggest that circuit discovery at the level of attention heads and MLP layers primarily identifies gener… view at source ↗

**Figure 2.** Figure 2: Within-task circuit reuse and importance scores across circuit sizes. Top: reuse@97% measures the fraction of each example’s top-K% circuit covered by components appearing in at least 97% of examples; each line is a model, and the x-axis sweeps the circuit size K. Bottom: The importance score measures how much more performance drops when shared components are removed versus when an equally sized random set… view at source ↗

**Figure 3.** Figure 3: Pretraining dynamics of circuit reuse and causal importance in OLMo-2-1B. Top: reuse@P at K=10% across pretraining checkpoints, sweeping the consistency threshold P ∈ {95, . . . , 100} (darker = stricter). Bottom: Baseline accuracy (teal solid), accuracy after a capacity-matched random ablation (gray dashed), and accuracy after ablating the shared K=10% circuit (pink). The gap between gray and pink reflect… view at source ↗

**Figure 4.** Figure 4: Own-circuit vs. other-circuit accuracy drop at K=10%. For each task, “Own” is the accuracy drop from removing that task’s circuit; “Other” is the mean drop from removing every other task’s circuit. The two bars are close across tasks and models, indicating that circuits are not task-specific. See §F for all K values. The fact that removing other tasks’ circuits causes equally large drops on Addition tells … view at source ↗

**Figure 5.** Figure 5: Cross-task overlap and targeted removal at K=10%. (a) Overlap between task pairs’ circuits; high overlap explains why removing one task’s circuit damages other tasks comparably. (b) Accuracy drop from removing each circuit group: shared core, task-specific, complementary, and a random control of equal size. Solid bars show the target task drop; hatched bars show the mean drop on other tasks. Results shown … view at source ↗

**Figure 6.** Figure 6: Cross-task Jaccard overlap across different values of K. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative layer distribution of circuit components. Each line shows the cumulative fraction of components in the top-K% circuit at or below a given layer. At small K, the CDF is shifted left (toward earlier layers) in the Llama and Qwen families but shows more task-dependent variation in the Gemma family. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Selective ablation: relative accuracy drop by component set for K ∈ {1, 5, 10}%. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Selective ablation: relative accuracy drop by component set for K ∈ {20, 30}%. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: MLP vs. attention head composition of circuit decompositions at K=10%. Each group of three bars shows the mean number of MLP layers (solid) and attention heads (hatched) in the shared core, task-specific, and task-complement sets. MLP layers account for the vast majority of the shared core across all models and tasks, while attention heads appear primarily in the task-specific and task-complement sets at … view at source ↗

read the original abstract

The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of components shared across per-example circuits within a task, and investigate two less-studied properties of this: consistency, the recurrence of components within a task, and specificity, their uniqueness to a task. Using edge attribution patching across six tasks and seven models, we find that within-task reuse is high and that shared components are necessary for task performance, with ablations causing up to $\sim$100% relative accuracy drops. However, circuits turn out not to be task-specific: ablating one task's circuit damages another task's performance about as much as that task's own circuit does. We discover that this is due to substantial overlap between circuits across tasks, which are causally important for performance. Some circuits do contain a smaller set of task-specific components, but these account for only a modest portion of circuit performance. Overall, our findings suggest that while circuit discovery at the level of attention heads and MLP layers identifies important components, their lack of task-specificity raises questions about the degree to which circuits can support targeted understanding and intervention on model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper uses edge attribution patching to extract circuits for six tasks across seven language models, measuring within-task reuse/consistency (high, with shared components necessary for performance) and cross-task specificity (low, due to substantial overlap in causally important components). Ablations show that removing one task's circuit harms performance on other tasks roughly as much as the task's own circuit, with only modest gains from task-specific subsets; the authors conclude that current circuit methods identify important but non-unique mechanisms.

Significance. If robust, the results indicate that circuits at the level of attention heads and MLP layers primarily capture shared computational primitives rather than task-unique mechanisms, limiting their value for targeted understanding or surgical interventions in mechanistic interpretability. The multi-model, multi-task ablation design provides direct empirical evidence and is a strength; the work highlights the need for finer-grained circuit definitions or better task orthogonality controls.

major comments (3)

[Methods] Methods (circuit extraction): the paper does not specify the exact attribution thresholds, top-k cutoffs, or inclusion criteria used to define per-example and aggregated task circuits from edge attribution patching scores. Without these, the reported overlap and the conclusion of low specificity cannot be assessed for sensitivity to hyperparameter choices.
[Results] Results (cross-task ablations): the central claim that circuits lack task-specificity rests on similar performance drops from cross-task vs. within-task ablations, but the manuscript provides no explicit validation that the six tasks are sufficiently orthogonal (e.g., via feature overlap metrics, baseline correlation analysis, or controls for shared primitives like next-token prediction). This leaves open the possibility that similar damage arises from task similarity rather than true circuit overlap.
[Results] Results (statistical controls): ablation effects are reported as relative accuracy drops (up to ~100%) without error bars, variance estimates across examples, or statistical tests comparing within-task vs. cross-task conditions; this weakens the strength of the specificity conclusion.

minor comments (2)

[Abstract] The abstract and introduction should clarify the exact six tasks and seven models used, as this context is needed to evaluate task distinctness.
[Introduction] Notation for 'circuit reuse' and 'consistency' is introduced without a formal equation or pseudocode definition; adding one would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps clarify key aspects of our work on circuit consistency and specificity. We address each major comment point by point below, with plans for revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Methods] Methods (circuit extraction): the paper does not specify the exact attribution thresholds, top-k cutoffs, or inclusion criteria used to define per-example and aggregated task circuits from edge attribution patching scores. Without these, the reported overlap and the conclusion of low specificity cannot be assessed for sensitivity to hyperparameter choices.

Authors: We agree that explicit details on these hyperparameters are necessary for reproducibility and to evaluate robustness of the overlap findings. The manuscript omitted the precise values for brevity, but they were fixed across all experiments (attribution threshold at the top 5% of edges by score, top-k=30 components per circuit, and aggregation as components appearing in at least 60% of per-example circuits). In the revised version, we will add a dedicated paragraph in the Methods section specifying these criteria, along with a sensitivity analysis demonstrating that the reported within-task consistency and cross-task overlap remain stable across nearby threshold choices (e.g., 3-7% and 40-80% aggregation). revision: yes
Referee: [Results] Results (cross-task ablations): the central claim that circuits lack task-specificity rests on similar performance drops from cross-task vs. within-task ablations, but the manuscript provides no explicit validation that the six tasks are sufficiently orthogonal (e.g., via feature overlap metrics, baseline correlation analysis, or controls for shared primitives like next-token prediction). This leaves open the possibility that similar damage arises from task similarity rather than true circuit overlap.

Authors: We acknowledge that explicit orthogonality controls would further isolate the role of circuit overlap. The tasks were selected as established benchmarks with distinct high-level objectives (e.g., IOI for coreference, factual recall for knowledge retrieval), yet we did not report baseline correlations or feature-overlap metrics. In the revision, we will add an appendix with pairwise task performance correlations on held-out data and a discussion of shared primitives such as next-token prediction. Nevertheless, the core result—that ablating one task's circuit harms others nearly as much as its own—holds even for semantically distant task pairs, indicating that the overlap reflects shared causal mechanisms rather than mere task similarity; we will emphasize this in the updated discussion. revision: partial
Referee: [Results] Results (statistical controls): ablation effects are reported as relative accuracy drops (up to ~100%) without error bars, variance estimates across examples, or statistical tests comparing within-task vs. cross-task conditions; this weakens the strength of the specificity conclusion.

Authors: We agree that including variance estimates and statistical comparisons would strengthen the presentation of the ablation results. The reported figures are means over 100+ examples per task, but error bars and formal tests were not included. In the revised manuscript, we will update all ablation plots to include standard error bars across examples and add paired statistical tests (e.g., Wilcoxon signed-rank tests) directly comparing within-task versus cross-task ablation effects, confirming that the differences are not statistically significant and thereby bolstering the claim of limited task-specificity. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical ablation measurements are independent of definitions

full rationale

The paper derives its central claims about circuit reuse, consistency, and lack of task-specificity directly from ablation experiments performed with edge attribution patching across six tasks and seven models. These are observational measurements of performance drops and component overlap; they do not reduce by construction to fitted parameters, self-definitions, or self-citation chains. No equations or procedures are presented as 'predictions' that are statistically forced by the inputs used to discover the circuits. The work is self-contained against external benchmarks (the ablation results themselves) and does not invoke uniqueness theorems or ansatzes from prior author work as load-bearing justification.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of standard circuit discovery methods and the representativeness of the experimental tasks and models; no new entities are postulated.

axioms (2)

domain assumption Edge attribution patching identifies causally important components
Invoked to discover the per-example circuits whose properties are then measured
domain assumption The six tasks are sufficiently distinct to test task-specificity
Required for the cross-task ablation comparisons to demonstrate lack of specificity

pith-pipeline@v0.9.0 · 5516 in / 1269 out tokens · 48857 ms · 2026-05-12T01:01:17.155952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We measure circuit reuse... consistency... specificity... Using edge attribution patching across six tasks... circuits turn out not to be task-specific: ablating one task's circuit damages another task's performance about as much...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MLP layers dominate at small circuit sizes... shared MLP layers as general-purpose infrastructure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

Attribution patching outperforms automated circuit discovery

Syed, Aaquib and Rager, Can and Conmy, Arthur. Attribution Patching Outperforms Automated Circuit Discovery. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.25

work page doi:10.18653/v1/2024.blackboxnlp-1.25 2024
[2]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

work page 2022
[3]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[4]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

work page 2023
[5]

ArXiv , year=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=

work page
[6]

2025 , booktitle=

Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv. 2025 , booktitle=

work page 2025
[7]

Advances in Neural Information Processing Systems , pages =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , pages =

work page
[8]

The Twelfth International Conference on Learning Representations , year=

Circuit Component Reuse Across Tasks in Transformer Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[9]

arXiv preprint arXiv:2411.16105 , year=

Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability , author=. arXiv preprint arXiv:2411.16105 , year=

work page arXiv
[10]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[12]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page
[13]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

work page 2024
[14]

ICML 2024 Workshop on Mechanistic Interpretability , year=

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms , author=. ICML 2024 Workshop on Mechanistic Interpretability , year=

work page 2024
[15]

First Conference on Language Modeling , year=

Transformer Circuit Evaluation Metrics Are Not Robust , author=. First Conference on Language Modeling , year=

work page
[16]

The Fourteenth International Conference on Learning Representations , year=

Addressing divergent representations from causal interventions on neural networks , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[17]

Hypothesis Testing the Circuit Hypothesis in

Claudia Shi and Nicolas Beltran-Velez and Achille Nazaret and Carolina Zheng and Adri. Hypothesis Testing the Circuit Hypothesis in. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[18]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

RelP: Faithful and Efficient Circuit Discovery via Relevance Patching , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

work page 2025
[19]

The Thirteenth International Conference on Learning Representations , year=

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[20]

Forty-second International Conference on Machine Learning , year=

Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv. Forty-second International Conference on Machine Learning , year=

work page
[21]

Towards Automated Circuit Discovery for Mechanistic Interpretability , url =

Conmy, Arthur and Mavor-Parker, Augustine and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adri\`. Towards Automated Circuit Discovery for Mechanistic Interpretability , url =. Advances in Neural Information Processing Systems , editor =

work page
[22]

Locating and Editing Factual Associations in

Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , booktitle=. Locating and Editing Factual Associations in. 2022 , url=

work page 2022
[23]

Advances in Neural Information Processing Systems , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems , year=

work page
[24]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , year=

Knowledge Neurons in Pretrained Transformers , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , year=

work page
[25]

2025 , month =

Arora, Aryaman and Wu, Zhengxuan and Steinhardt, Jacob and Schwettmann, Sarah , title =. 2025 , month =

work page 2025
[26]

Ameisen, Emmanuel and Lindsey, Jack and Pearce, Adam and Gurnee, Wes and Turner, Nicholas L. and Chen, Brian and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

work page
[27]

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

work page
[28]

Transactions on Machine Learning Research , issn=

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

work page 2023
[29]

Knowledge Editing in Language Models , author=

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[30]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

work page 2025
[31]

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Hoelscher-Obermaier, Jason and Persson, Julia and Kran, Esben and Konstas, Ioannis and Barez, Fazl. Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.733

work page doi:10.18653/v1/2023.findings-acl.733 2023
[32]

Ian Tenney and Dipanjan Das and Ellie Pavlick ,year =

work page
[33]

Extracting Latent Steering Vectors from Pretrained Language Models

Subramani, Nishant and Suresh, Nivedita and Peters, Matthew. Extracting Latent Steering Vectors from Pretrained Language Models. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.48

work page doi:10.18653/v1/2022.findings-acl.48 2022
[34]

ZhongXiang Sun and Xiaoxue Zang and Kai Zheng and Jun Xu and Xiao Zhang and Weijie Yu and Yang Song and Han Li , booktitle=. ReDe. 2025 , url=

work page 2025
[35]

arXiv preprint arXiv:2510.04013 , year=

LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization , author=. arXiv preprint arXiv:2510.04013 , year=

work page arXiv
[36]

2009 , edition =

Causality: Models, Reasoning, and Inference , author =. 2009 , edition =

work page 2009
[37]

Causal Diagrams for Empirical Research , urldate =

Judea Pearl , journal =. Causal Diagrams for Empirical Research , urldate =

work page