pith. machine review for the scientific record. sign in

arxiv: 2602.13483 · v2 · submitted 2026-02-13 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Finding Interpretable Prompt-Specific Circuits in Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords mechanistic interpretabilityattention circuitslanguage modelscircuit discoveryindirect object identificationmultilingual modelscausal signalsprompt-specific circuits
0
0 comments X

The pith

ACC++ extracts causal attention signals from language models in a single forward pass, revealing many are interpretable via natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ACC++, an improved circuit-tracing method based on the attention-causal communication principle. It identifies the components that cause attention decisions along with the low-dimensional signals used to communicate between them, all extracted from one forward pass without replacement models or patching. Across models, a substantial portion of these signals admit short natural-language descriptions, making them interpretable. When applied to indirect object identification, the method shows prompt-specific circuits forming well-defined clusters where heads use distinct signals for different identification mechanisms. In multilingual settings, components are reused across languages but signals are often language-specific, with circuit distances matching linguistic relatedness.

Core claim

ACC++ extracts circuits consisting of components causal for the model's attention decisions together with the low-dimensional signals used to communicate between them. These signals are the contents of subspaces that cause attention on a token pair. The method works from a single forward pass. A substantial portion of the signals are interpretable, admitting short natural-language descriptions. Applied to indirect object identification, it characterizes sensitivity to prompt structure through prompt-specific circuit clusters and shows that in multilingual cases components are shared while signals are language-specific with distances consistent with linguistic relatedness.

What carries the argument

The attention-causal communication principle, which locates low-dimensional subspaces whose contents cause attention on specific token pairs.

If this is right

  • Prompt-specific IOI circuits form clusters with heads receiving systematically different signals for distinct identification mechanisms.
  • In multilingual IOI, model components are reused across languages while signals remain language-specific.
  • Cross-language circuit distances are consistent with linguistic relatedness.
  • ACC++ reveals how circuits adapt to prompt structure and enables characterization of model behavior across varied inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could map circuits for other tasks to show how models handle input variations more generally.
  • If signals are causal, targeted changes to specific subspaces might steer attention without full model retraining.
  • Clustering of circuits by prompt suggests models maintain modular strategies that could be selectively activated.
  • Signal analysis might extend to phenomena like negation or coreference to uncover additional mechanisms.

Load-bearing premise

The low-dimensional subspaces identified are the actual causal signals driving attention decisions, and the natural-language descriptions assigned to them reflect genuine model mechanisms rather than post-hoc patterns.

What would settle it

An intervention that perturbs or removes the identified signals in the subspaces but leaves the model's attention patterns unchanged would show the signals are not causal.

Figures

Figures reproduced from arXiv: 2602.13483 by Azalea Rohr, Gabriel Franco, Lucas M. Tassis, Mark Crovella.

Figure 1
Figure 1. Figure 1: Average linkage clustering of prompt-level traces exposes distinct circuit families rather than a single universal IOI circuit. The top annotation bar indicates high-level templates (ABBA vs. BABA), while the left bar indicates low-level templates (see Appendix F.1 for color code). Circuits are represented as sets of edge–singular-vector pairs. template differs by model. For GPT-2 Small, the dominant split… view at source ↗
Figure 2
Figure 2. Figure 2: Signal similarity uncovers common and distinct functionality across prompts. Comparison of signal similarity matrices between representative circuits across three different models, organized by column: GPT-2 Small (left), Pythia-160M (middle), and Gemma-2 2B (right). Representatives for GPT-2 Small are from ABBA (x-axis) and BABA (y-axis). Representatives for Pythia-160M are from Template 10 ( , x-axis) an… view at source ↗
Figure 3
Figure 3. Figure 3: Traces are interpretable and expose algorithmic differences in circuits between ABBA and BABA in GPT-2. AH node labels are (destination token, source token); edge labels are automatically generated. Red: feeds logits, purple: low-level, orange: provide inhibition signal. Dark green nodes show that, in BABA only, the circuit relies on identifying the “second item in a parallel pair”, ie, “Kelly”, as the app… view at source ↗
Figure 4
Figure 4. Figure 4: Condition numbers of WQ (left) and W⊤ K (right) from GPT-2 small. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Condition numbers of WQ (left) and W⊤ K (right) from Pythia-160M. 0 1 2 3 4 5 6 7 AH idx 0 3 6 9 12 15 18 21 24 Layer 50 100 150 200 0 1 2 3 4 5 6 7 AH idx 0 3 6 9 12 15 18 21 24 Layer 50 100 150 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Condition numbers of WQ (left) and W⊤ K (right) from Gemma-2 2B. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Recipe for the ACC++ solver, explicitly distinguishing destination candidates (P k U ) and source candidates (P k V ), with source-view candidates applied across all j ≤ d. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ECDF of the attention weight times the context size for GPT-2 small in IOI (a), GP (b), and GT (c). The vertical line shows a τ = 2.5. 0 5 10 15 20 Attention weight * context size 0.0 0.2 0.4 0.6 0.8 1.0 Cummulative probability 0 2 4 6 8 10 Attention weight * context size 0.0 0.2 0.4 0.6 0.8 1.0 Cummulative probability 0.0 2.5 5.0 7.5 10.0 12.5 Attention weight * context size 0.0 0.2 0.4 0.6 0.8 1.0 Cummul… view at source ↗
Figure 9
Figure 9. Figure 9: ECDF of the attention weight times the context size for Pythia-160M in IOI (a), GP (b), and GT (c). The vertical line shows a τ = 2.5. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: ECDF of the attention weight times the context size for Gemma-2 2B in IOI (a) and GP (b). The vertical line shows a τ = 2.5. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Across all tasks and models, most ACC++ signals are rank-1 (use one singular-vector direction). 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Fraction of singular vectors in S ℓads 0 2 4 6 8 Density gpt2-small pythia-160m gemma-2-2b 0.0 0.1 0.2 0.3 0.4 0.5 Fraction of singular vectors in S ℓads 0 2 4 6 Density gpt2-small pythia-160m 0.0 0.1 0.2 0.3 0.4 0.5 Fraction of singular vectors in S ℓads 0.0 2.5 5.0 7.5 10.0 Density… view at source ↗
Figure 12
Figure 12. Figure 12: ACC++ selects lower-dimensional signals than ACC across models and tasks, shifting mass toward rank-1 signals. From left to right: IOI, GT, GP. Despite this compression, ACC++ preserves downstream circuit results under the (Franco & Crovella, 2025) setup. Circuit quality remains comparable to the original ACC solver across tasks and models ( [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: ACC++ traces use fewer nodes than ACC across tasks (IOI, GP, GT), indicating more compact circuits under the same tracing setup. Error bars show standard deviation. gemma-2-2b gpt2-small pythia-160m 10 2 10 3 10 4 Number of Edges ACC ACC++ (a) IOI gemma-2-2b gpt2-small pythia-160m 10 1 10 2 Number of Edges ACC ACC++ (b) GP gpt2-small pythia-160m 10 2 10 3 Number of Edges ACC ACC++ (c) GT [PITH_FULL_IMAGE… view at source ↗
Figure 14
Figure 14. Figure 14: ACC++ traces use fewer edges than ACC across tasks (IOI, GP, GT), reflecting a smaller set of causal communications needed to explain the same behaviors. Error bars show standard deviation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: ACC++ requires fewer incoming causal signals per node than ACC on IOI (pooled in-degrees across traced circuits), indicating that ACC++ typically establishes attention causality with fewer signals. 10 1 In-degree 0.0 0.2 0.4 0.6 0.8 1.0 ECDF ACC ACC++ (a) Gemma-2 2B 10 0 10 1 In-degree 0.0 0.2 0.4 0.6 0.8 1.0 ECDF ACC ACC++ (b) GPT-2 Small 10 1 In-degree 0.0 0.2 0.4 0.6 0.8 1.0 ECDF ACC ACC++ (c) Pythia-1… view at source ↗
Figure 16
Figure 16. Figure 16: ACC++ requires fewer incoming causal signals per node than ACC (RA) on GP (pooled in-degrees across traced circuits), indicating that ACC++ typically establishes attention causality with fewer signals. 10 0 10 1 10 2 In-degree 0.0 0.2 0.4 0.6 0.8 1.0 ECDF ACC ACC++ (a) Gemma-2 2B 10 0 10 1 In-degree 0.0 0.2 0.4 0.6 0.8 1.0 ECDF ACC ACC++ (b) GPT-2 Small 10 1 In-degree 0.0 0.2 0.4 0.6 0.8 1.0 ECDF ACC ACC+… view at source ↗
Figure 17
Figure 17. Figure 17: ACC++ requires fewer incoming causal signals per node than ACC (RA) on GT (pooled in-degrees across traced circuits), indicating that ACC++ typically establishes attention causality with fewer signals. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Circuit performance under the experimental setup of (Franco & Crovella, 2025), comparing ACC++ to the original ACC solver (Relative Attention; RA). ACC++ achieves comparable circuit quality. S-Inhibition -> Name Mover Induction -> S-Inhibition Prev. Token -> Induction −0.25 0.00 0.25 (F(E, h) - F) / F (a) GPT2-Small S-Inhibition -> Name Mover Induction -> S-Inhibition Prev. Token -> Induction −0.5 0.0 0.5… view at source ↗
Figure 19
Figure 19. Figure 19: Causal effect of ACC++ signal interventions on Indirect Object Identification (IOI) performance across models. Green/red bars show ACC++ signal ablation/boosting, while blue/orange bars show random signal ablation/boosting controls. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Residual-stream perturbation induced by ACC++ interventions. We report cosine similarity and norm ratio between residual vectors before vs. after intervention. Despite minimal residual change (consistent with low-rank signals), interventions can have substantial causal effects on performance (cf [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: provides an illustration of these three component definitions using a simplified subgraph [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Average linkage clustering of prompt-level traces exposes distinct circuit families rather than a single universal IOI circuit. The top annotation bar indicates high-level templates (ABBA vs. BABA), while the left bar indicates low-level templates (see Appendix F.1 for color code). Prompts are represented as heads as components [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Average linkage clustering of prompt-level traces exposes distinct circuit families rather than a single universal IOI circuit. The top annotation bar indicates high-level templates (ABBA vs. BABA), while the left bar indicates low-level templates (see Appendix F.1 for color code). Prompts are represented as edges as components. F.4. Representative Assuming a set of objects C = {c1, . . . , cK}, and some … view at source ↗
Figure 24
Figure 24. Figure 24: Pythia-160M Template 13 ( , y-axis) vs Template 5 ( , x-axis). (1, 2, ,) (1, 2, gave) (1, 2, to (1)) (1, 10, Jack (1)) (1, 10, to (1)) (2, 1, gave) (2, 2, ,) (2, 3, to (1)) (2, 6, to (1)) (2, 9, to (1)) (2, 11, to (1)) (3, 2, and) (3, 2, ,) (3, 2, Jack (1)) (3, 2, to (1)) (3, 3, to (1)) (3, 10, Jack (1)) (4, 0, to (1)) (4, 1, to (1)) (4, 5, Jack (1)) (4, 5, to (1)) (4, 6, Jack (1)) (4, 8, Jack (1)) (4, 11… view at source ↗
Figure 25
Figure 25. Figure 25: Pythia-160M Template 13 ( , y-axis) vs Template 10 ( ). 35 [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Pythia-160M Template 8 ( , y-axis) vs Template 6 ( , x-axis). (1, 1, give) (1, 2, it) (1, 10, to (1)) (2, 0, to (1)) (2, 2, ,) (2, 9, to (1)) (2, 11, to (1)) (3, 1, give) (3, 1, to (1)) (3, 2, and) (3, 2, ,) (3, 2, David (1)) (3, 3, to (1)) (3, 9, it) (4, 1, to (1)) (4, 5, David (1)) (4, 5, it) (4, 6, David (1)) (4, 9, Henry) (4, 11, David (1)) (5, 8, to (1)) (6, 5, to (1)) (6, 6, to (1)) (6, 10, David (1… view at source ↗
Figure 27
Figure 27. Figure 27: Pythia-160M Template 8 ( , y-axis) vs Template 7 ( , x-axis). 36 [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Pythia-160M Template 14 ( , y-axis) vs Template 15 ( , x-axis) [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Clustermap of circuits for Gemma-2 2B highlighting two clusters. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Gemma-2 2B Template 5 ( , y-axis) vs Template 13 ( , x-axis). (1, 3, to) (1, 4, Carter (1)) (1, 5, Carter (1)) (1, 5, gave) (1, 7, Carter (1)) (2, 3, to) (2, 5, Carter) (2, 5, Robert) (2, 5, Carter (1)) (2, 5, gave) (2, 6, drink) (2, 7, drink) (3, 1, Carter (1)) (3, 4, to) (3, 6, to) (4, 0, Robert) (4, 2, to) (5, 0, and) (5, 0, Robert) (5, 0, to) (5, 4, to) (6, 2, Carter (1)) (6, 3, Carter (1)) (7, 1, to)… view at source ↗
Figure 31
Figure 31. Figure 31: Gemma-2 2B Cluster B (y-axis) vs Cluster A (x-axis). 38 [PITH_FULL_IMAGE:figures/full_fig_p038_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Layerwise automated-interpretability metrics under the fuzzing protocol for three models. Solid lines show the median per layer for accuracy (blue), precision (orange), and recall (green); shaded bands show the interquartile range (25%–75%) across signals within each layer. Dashed horizontal lines denote the corresponding median metric computed across all layers (same color as the metric). G.3. Additional… view at source ↗
Figure 33
Figure 33. Figure 33: a shows a portion of the GPT-2 circuit for the ABBA representative. Proper noun features are shown in dark green. The figure shows that the “Justin” token is annotated with a “proper noun” feature by 5 MLPs (MLPs 0, 1, 2, 7, and 8) and by attention head (0, 8). It also shows that the “to” token is annotated with the same feature, indicating that it can match with a proper noun. This matching of proper-nou… view at source ↗
Figure 34
Figure 34. Figure 34: Traces expose key differences in Pythia circuits for Templates 9 and 10: both include a shared repeated-name circuit instantiated by head (4, 5), while Template 10 additionally instantiates head (4, 5) on the repeated to token and routes this second instance through head (6, 5) before reaching later-layer heads and logits. Gemma: Examples of non-IOI tasks. Finally, we include examples of tasks that are no… view at source ↗
Figure 35
Figure 35. Figure 35: “Fact: The capital of the state containing Dallas is” (Austin) Finally, we turn to an example illustrating the use of ACC++ to elucidate a circuit that performs in-context learning (ICL). The prompt is “What sport Jordan played? A: Basketball. What sport Tom Brady played? A:” and the model correctly answers “football.” [PITH_FULL_IMAGE:figures/full_fig_p048_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: “What sport Jordan played? A: Basketball. What sport Tom Brady played? A:” (football). 49 [PITH_FULL_IMAGE:figures/full_fig_p049_36.png] view at source ↗
read the original abstract

Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. A crucial part of finding circuits is understanding why each attention head attends where it does. To this end, we introduce ACC++, an improved circuit-tracing method based on the principle of attention-causal communication (ACC) [1], which identifies signals, i.e., contents of low dimensional subspaces that cause attention on a token pair. ACC++ extracts circuits from a single forward pass, without replacement models or patching. Circuits identified by ACC++ consist of components that are causal for the model's attention decisions, together with the low-dimensional signals used to communicate between them. Here, we first detail the conceptual advances that ACC++ makes over previous work. We then show that across multiple models, a substantial portion of ACC++ signals are interpretable: many signals admit a short natural-language description. We next present a number of new insights into model behavior obtained via ACC++. First, we use ACC++'s interpretable circuits to characterize the sensitivity of indirect object identification (IOI) circuits to prompt structure. We find that prompt-specific circuits form well-defined clusters, and across clusters, heads receive systematically different signals corresponding to distinct mechanisms for identifying the IO name. Next, in multilingual IOI, ACC++ circuits show that while model components are reused across languages, signals are often language-specific. In a four-language IOI case study, cross-language circuit distances are consistent with linguistic relatedness. Together, these results show that ACC++ can shed light on a broad spectrum of model behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ACC++, an improved circuit-tracing method based on the attention-causal communication (ACC) principle, which identifies low-dimensional signals causing attention on token pairs. It claims to extract circuits—consisting of causal components and the signals communicating between them—from a single forward pass without replacement models or patching. Across models, a substantial portion of these signals are reported as interpretable via short natural-language descriptions. The authors apply ACC++ to indirect object identification (IOI), finding that prompt-specific circuits form well-defined clusters with heads receiving systematically different signals for distinct mechanisms, and to multilingual IOI, showing component reuse but language-specific signals whose cross-language distances align with linguistic relatedness.

Significance. If the causal status of the identified subspaces is established, ACC++ would offer a computationally efficient alternative to patching-based methods for discovering interpretable circuits and prompt-specific behaviors, enabling broader analysis of model mechanisms such as sensitivity to prompt structure and cross-lingual processing.

major comments (3)
  1. [Abstract] Abstract: the claim that ACC++ signals 'cause attention on a token pair' and that circuits consist of 'components that are causal for the model's attention decisions' rests entirely on the untested ACC principle without any interventions, ablations, or counterfactual tests; no patching or replacement-model experiments are described to verify sufficiency or necessity of the subspaces.
  2. [Results on interpretability] Interpretability results: the assertion that 'a substantial portion of ACC++ signals are interpretable' provides no quantitative metrics (e.g., fraction of signals with descriptions, inter-rater reliability, or automated validation), error analysis, or details on how natural-language descriptions were assigned and verified, leaving the strength of the interpretability claim unassessable.
  3. [IOI case study] IOI analysis: the claims that prompt-specific circuits 'form well-defined clusters' and that heads 'receive systematically different signals corresponding to distinct mechanisms' lack specification of the clustering algorithm, distance metric, statistical significance tests, or controls showing that signal differences drive the observed behavioral distinctions rather than correlate with them.
minor comments (1)
  1. [Introduction] The description of conceptual advances over prior ACC work would benefit from an explicit comparison table or enumerated list of differences in assumptions and outputs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, proposing specific revisions to strengthen the paper while preserving its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that ACC++ signals 'cause attention on a token pair' and that circuits consist of 'components that are causal for the model's attention decisions' rests entirely on the untested ACC principle without any interventions, ablations, or counterfactual tests; no patching or replacement-model experiments are described to verify sufficiency or necessity of the subspaces.

    Authors: The ACC principle was introduced and empirically validated via interventions in our prior work [1], which established the causal role of the identified low-dimensional subspaces for attention decisions. ACC++ extends this by providing an efficient extraction method without requiring new replacement models. To directly address the concern, we will revise the abstract and introduction to explicitly reference the validation experiments from [1], add a dedicated subsection summarizing those results, and include new ablation studies (e.g., subspace perturbation tests) on the models studied here to reconfirm sufficiency and necessity. revision: yes

  2. Referee: [Results on interpretability] Interpretability results: the assertion that 'a substantial portion of ACC++ signals are interpretable' provides no quantitative metrics (e.g., fraction of signals with descriptions, inter-rater reliability, or automated validation), error analysis, or details on how natural-language descriptions were assigned and verified, leaving the strength of the interpretability claim unassessable.

    Authors: We agree that quantitative support is needed to make the interpretability claim fully assessable. The current manuscript relies on qualitative examples; in revision we will add: (i) the exact fraction of signals assigned consistent natural-language descriptions across models, (ii) inter-rater reliability metrics (Cohen's kappa) from multiple annotators, (iii) a detailed description of the annotation protocol, and (iv) an error analysis categorizing uninterpretable signals. These additions will be placed in a new subsection under Results. revision: yes

  3. Referee: [IOI case study] IOI analysis: the claims that prompt-specific circuits 'form well-defined clusters' and that heads 'receive systematically different signals corresponding to distinct mechanisms' lack specification of the clustering algorithm, distance metric, statistical significance tests, or controls showing that signal differences drive the observed behavioral distinctions rather than correlate with them.

    Authors: We will clarify the clustering procedure in the revised Methods and Results sections: k-means clustering was applied to the signal embeddings using cosine distance. We will report silhouette scores, perform permutation tests for cluster significance, and add controls (e.g., comparison against shuffled signal labels) to demonstrate that the observed behavioral distinctions are driven by signal differences rather than mere correlation. Updated figures and quantitative tables will accompany these additions. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to prior ACC principle; new claims are independent empirical observations

full rationale

The paper introduces ACC++ as building on the attention-causal communication (ACC) principle from reference [1] and applies it to extract circuits and signals from single forward passes. The central results on signal interpretability, prompt-specific circuit clusters in IOI, and cross-language signal differences are presented as direct observations from this application, without any equations, derivations, or reductions that make the outputs equivalent to inputs by construction. No fitted parameters are renamed as predictions, no self-definitional loops appear, and no uniqueness theorems or ansatzes are smuggled via self-citation in a load-bearing way for the new claims. The method is self-contained as an observational extension, warranting only a minor score for the reference to prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method inherits the attention-causal communication principle from prior work without detailing new assumptions here.

pith-pipeline@v0.9.0 · 5588 in / 1211 out tokens · 19689 ms · 2026-05-15T22:03:52.950991+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Data-driven Circuit Discovery for Interpretability of Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 1 Pith paper

  1. [1]

    the input residual embedding at token d at the input to layer ℓ

    URL https://arxiv.org/abs/2308.0 9124. Hewitt, J., Geirhos, R., and Kim, B. Position: We can’t understand AI using our existing vocabulary. InForty- second International Conference on Machine Learning 10 Finding Highly Interpretable Prompt-Specific Circuits in Language Models Position Paper Track, 2025. URL https://openre view.net/forum?id=asQJx56NqB. Hub...

  2. [2]

    Set up the counterfactual.Choose an attention head and a destination–source pair, then decide whether we are searching fordestinationsignals (in the destination token) orsourcesignals (distributed across all source tokens that compete under Softmax)

  3. [3]

    Enumerate candidate signals.Decompose the residual stream into outputs of upstream components, and project each component’s contribution onto the head’s singular-vector directions to form a set of candidate signals

  4. [4]

    Build a contribution table.Convert candidates into a contribution matrix whose rows correspond to candidate signals and whose columns correspond to source positions in the destination row of attention scores

  5. [5]

    Score candidates with attribution.Use Integrated Gradients to assign each candidate a fixed importance score for the attention weight on the chosen source token, while accounting for Softmax competition across all sources

  6. [6]

    dimensionality

    Solve the counterfactual by greedy removal.Starting from the full set of candidates, iteratively remove the highest- scoring candidates and recompute the attention weight until it drops below a chosen threshold; the removed candidates form the ACC++ explanation set. Destination intervention (find destination signals).Let Mdst denote a set of destination-s...

  7. [7]

    Interpretation Stage:we prompt an LLM with the top-40 examples for a signal to generate a single short interpretation

  8. [8]

    the [[cat]] sat on the << mat>>

    Evaluation Stage:to score an interpretation, we take the top-20 examples (highest-scoring among the top-40) and mix them with 20 random control examplessampled independently for that same signal, then ask an LLM judge to decide whether the interpretation explains each example (Paulo et al., 2025). Interpretation stageGiven the top-40 activating examples f...

  9. [9]

    Return the results as a Python Dictionary

  10. [10]

    The keys must be the example numbers (1 to 10), and the values must be the binary label (0 or 1)

  11. [11]

    Do not assume the order; explicitly check the number I assigned to each example

  12. [12]

    Ignore any numbers or formatting artifacts INSIDE the text strings (e.g., if a text contains "4).", ignore it)

  13. [13]

    10: 0 } Here are the examples: <user_prompt> Feature interpretation: Words related to American football positions, specifically the tight end position

    Output format: { 1: 0, 2: 1, ... 10: 0 } Here are the examples: <user_prompt> Feature interpretation: Words related to American football positions, specifically the tight end position. Text examples:

  14. [14]

    Getty Images [[ Patriots]]<< tight>> end Rob Gronkowski had his boss

  15. [15]

    posted You should know this[[ about]] offensive line coaches: they are large, demanding<< men>>

  16. [16]

    Media Day 2015 LSU [[ defensive]] end Isaiah Washington (94) speaks to<< the>>

  17. [17]

    running [[ backs]],’’ he said. .. Defensive << end>> Carroll Phillips is improving and his injury is

  18. [18]

    Then, Jack and Kelly went to the garden. Jack gave a basketball to

    [[ line]], with the left side namely << tackle>> Byron Bell at tackle and guard Amini <assistant_response> { 1: 1, 2: 0, 3: 0, 4: 1, 5: 1, } Now evaluate the following examples: Signals for which the interpreter returns no valid interpretation found are treated as missing interpretations and are excluded from scoring-based analyses. Across models, the fra...