arxiv: 2604.07615 · v1 · submitted 2026-04-08 · 💻 cs.CL

Recognition: unknown

ADAG: Automatically Describing Attribution Graphs

Aryaman Arora , Zhengxuan Wu , Jacob Steinhardt , Sarah Schwettmann

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords circuit tracingattribution graphslanguage model interpretabilityautomated explanationgradient attributionfeature clusteringjailbreak analysis

0 comments

The pith

An automated pipeline uses gradient profiles and LLM explanations to describe feature roles in language model attribution graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ADAG as a fully automated system to replace manual human interpretation in circuit tracing for language models. It defines attribution profiles from input and output gradients to measure how each feature influences computations, groups them with a new clustering method, and employs an LLM setup to generate and validate natural-language descriptions of those groups. The system recovers circuits previously identified by hand on standard tasks and locates clusters that control a specific harmful behavior in Llama 3.1. If the method works, it removes the main bottleneck that has kept circuit analysis small-scale and expert-dependent.

Core claim

ADAG is an end-to-end pipeline that builds attribution profiles to quantify each feature's causal input and output effects, applies a clustering algorithm to group related features, and runs an LLM explainer-simulator loop to produce scored natural-language accounts of what each cluster does; when tested on established circuit-tracing benchmarks it reproduces human-interpretable circuits and additionally isolates steerable clusters that implement a harmful-advice jailbreak in Llama 3.1 8B Instruct.

What carries the argument

Attribution profiles that quantify a feature's functional role through its input and output gradient effects, together with a clustering algorithm and an LLM explainer-simulator that produces and scores natural-language descriptions of feature groups.

If this is right

Known human-analyzed circuits can be recovered automatically without manual inspection of activation data.
Steerable feature clusters responsible for specific model behaviors such as harmful advice can be located and described at scale.
Circuit tracing can move from ad-hoc small studies to systematic, repeatable analysis across many tasks and models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the descriptions prove reliable, the same pipeline could be applied to audit larger models for unintended behaviors without first knowing what to look for.
The method opens the possibility of comparing circuit structures across different model families to identify shared computational motifs.
Repeated application might allow tracking how circuits for safety-related behaviors change during fine-tuning or alignment.

Load-bearing premise

That gradient-derived attribution profiles combined with LLM-generated explanations correctly capture the actual causal roles played by feature groups inside the model.

What would settle it

A new set of circuit-tracing tasks where independent human analysts produce circuit descriptions that systematically differ from the automated ADAG outputs in the functional roles assigned to the recovered clusters.

Figures

Figures reproduced from arXiv: 2604.07615 by Aryaman Arora, Jacob Steinhardt, Sarah Schwettmann, Zhengxuan Wu.

**Figure 2.** Figure 2: Results of MLP non-locality experiments (excluding BOS from all analyses). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Results of description generation experiments for gold supernodes for the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Final circuit graph for texas example in the capitals dataset from Llama 3.1 8B Instruct. We show input attribution, output contribution, and neuron descriptions for C44 ‘Dallas Texas’ to the left. Red indicates negative attribution score, and blue the opposite. with a mean, i.e. Sij = 1 |C| ∑ |C| c=0 (Attr(c) ij + Contrib(c)) ij )/2; (c) post-hoc adjustment of the similarity matrix in order to keep affini… view at source ↗

**Figure 5.** Figure 5: Cluster attribution for top 3-base-ASR influencing clusters per our results, com [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Benchmarking results for circuit tracing + attribution profile computation for [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Additional locality results for MLP neurons in Llama 3.1 8B Instruct. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Attribution description score for quantile and topk approaches, when sweeping [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Final circuit graph for texas example in the capitals dataset from Llama 3.1 8B Instruct. D.2.1 austin Cluster Token 1 Token 2 Token 3 Token 4 Token 5 C0: western state capitals _Austin (96.5%) _Texas (1.4%) _The (0.8%) _Oklahoma (0.7%) _Dallas (0.2%) C1: state name bias _Austin (94.1%) _The (2.8%) _Texas (0.5%) _Oklahoma (0.5%) _Dallas (0.4%) C2: capital city first token _Texas (93.4%) _Oklahoma (3.2%) _T… view at source ↗

**Figure 10.** Figure 10: Final circuit graph for texas example in the capitals dataset from Qwen3 32B. D.4.1 austin Cluster Token 1 Token 2 Token 3 Token 4 Token 5 C1: state capital names Austin (100.0%) Texas (0.0%) A (0.0%) Dallas (0.0%) The (0.0%) C2: capital initial letters Austin (99.6%) The (0.5%) A (0.0%) Dallas (0.0%) O (0.0%) C4: not[direct capital answers] Austin (100.0%) A (0.1%) Dallas (0.0%) O (0.0%) Texas (0.0%) C5:… view at source ↗

**Figure 11.** Figure 11: Final circuit graph for 18 + 24 = 42 example in the math dataset from Llama 3.1 8B Instruct [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: ; the x-axis is the first operand and the y-axis is the second operand. This clearly tells us the contexts in which the cluster is active; e.g. C113 (sums near 42) only tends to be active when the sum is ≈ 40 or ≈ 140. Similarly, C8 (correct even operand sums) is active when the sum is an even number. These clusters unsupervisedly find the ‘bags-of-heuristics’ that this model is known to use when solving … view at source ↗

read the original abstract

In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ADAG automates circuit description with gradient profiles and an LLM explainer loop, but the causal claims rest on untested qualitative outputs.

read the letter

The main thing to know is that this paper presents ADAG as a full pipeline to replace manual inspection when describing circuits: it builds attribution profiles from input and output gradients, clusters features with a new algorithm, and uses an LLM explainer-simulator to generate and score natural-language descriptions. It tests the system on standard human-analyzed circuit tasks and on a harmful-advice jailbreak in Llama 3.1 8B Instruct, claiming to recover interpretable circuits and steerable clusters.

Referee Report

2 major / 1 minor

Summary. The paper introduces ADAG, an end-to-end automated pipeline for describing attribution graphs in language models. It defines attribution profiles via input and output gradients to quantify feature functional roles, proposes a novel clustering algorithm for grouping features, and employs an LLM explainer-simulator to generate and score natural-language explanations of the resulting groups. The system is evaluated on known human-analyzed circuit-tracing tasks, where it claims to recover interpretable circuits, and is applied to identify steerable clusters responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.

Significance. If the claims are substantiated with quantitative validation, ADAG would represent a meaningful advance in scalable mechanistic interpretability by automating what has previously required ad-hoc human analysis of circuits. The combination of gradient-based profiles with LLM-assisted explanation and clustering is a creative direction that could reduce manual effort in identifying causal structures and undesirable behaviors. The absence of metrics and causal tests in the current version, however, limits the assessed significance.

major comments (2)

[Abstract] Abstract: the claim that ADAG 'recovers interpretable circuits' on known human-analysed tasks provides no quantitative metrics (e.g., alignment scores with human annotations, precision/recall against ground-truth circuits, or comparison to baselines), nor details on how faithfulness of the LLM explanations was measured. This is load-bearing for the central claim of successful automation.
[Abstract] Abstract: the assertion that the identified clusters are 'responsible for a harmful advice jailbreak' and 'steerable' lacks reported intervention or steering experiments that isolate the clusters' causal effect (e.g., ablation or activation patching results showing output change relative to controls). Gradient-based attribution profiles measure sensitivity but do not by themselves establish the required causality without such tests.

minor comments (1)

[Abstract] Abstract: the term 'attribution profiles' is introduced without even a brief formal characterization or reference to the gradient definitions that will appear later; adding a short parenthetical or equation reference would improve immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below and commit to revisions that add the requested quantitative and causal evidence.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that ADAG 'recovers interpretable circuits' on known human-analysed tasks provides no quantitative metrics (e.g., alignment scores with human annotations, precision/recall against ground-truth circuits, or comparison to baselines), nor details on how faithfulness of the LLM explanations was measured. This is load-bearing for the central claim of successful automation.

Authors: We agree that quantitative metrics are necessary to substantiate the claim of successful automation. The current manuscript evaluates circuit recovery through qualitative alignment with previously published human-analyzed circuits (e.g., the indirect-object-identification circuit). No precision/recall figures, alignment scores, baseline comparisons, or explicit faithfulness metrics for the LLM explanations appear in the submitted version. We will add a dedicated quantitative evaluation subsection that reports precision and recall against ground-truth circuits, inter-annotator agreement for explanation faithfulness, and comparisons against standard clustering baselines. revision: yes
Referee: [Abstract] Abstract: the assertion that the identified clusters are 'responsible for a harmful advice jailbreak' and 'steerable' lacks reported intervention or steering experiments that isolate the clusters' causal effect (e.g., ablation or activation patching results showing output change relative to controls). Gradient-based attribution profiles measure sensitivity but do not by themselves establish the required causality without such tests.

Authors: We accept that attribution profiles alone demonstrate sensitivity rather than direct causality. The manuscript identifies candidate clusters via the gradient profiles and LLM explainer-simulator and asserts steerability on that basis, but does not report ablation or activation-patching experiments with control conditions. We will incorporate such experiments in the revision, including cluster ablation and targeted activation patching that quantify the change in harmful-advice generation probability relative to matched controls. revision: yes

Circularity Check

0 steps flagged

No circularity in ADAG pipeline derivation

full rationale

The paper describes an empirical end-to-end pipeline consisting of gradient-based attribution profiles, a clustering algorithm, and an LLM explainer-simulator. No equations, derivations, or self-referential definitions are present that would reduce any claimed result to its inputs by construction. Evaluations on known circuit-tracing tasks and the jailbreak example are presented as external recoveries and demonstrations rather than tautological predictions or fitted renamings. Any self-citations are incidental and not load-bearing for the core method, which relies on standard gradient computations and off-the-shelf LLMs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on two new invented concepts (attribution profiles and the explainer-simulator) plus the domain assumption that gradient effects suffice to group functionally similar features. No free parameters are described in the abstract.

axioms (1)

domain assumption Input and output gradient effects are sufficient to quantify the functional role of a feature for clustering purposes.
Used to define attribution profiles as the basis for the clustering step.

invented entities (2)

attribution profiles no independent evidence
purpose: Quantify functional role of a feature via its input and output gradient effects
New representation introduced to replace manual inspection of activation data.
LLM explainer-simulator setup no independent evidence
purpose: Generate and score natural-language explanations of feature-group roles
Novel automated explanation component that replaces human interpretation.

pith-pipeline@v0.9.0 · 5498 in / 1337 out tokens · 45188 ms · 2026-05-10T17:17:59.216737+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URLhttps://arxiv.org/abs/2309.08600. 10 Preprint. Under review. Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.),Ad- vances in Neural Information Processing Systems 38: Annual Conf...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5555/1953048.2021071 2024
[2]

arXiv preprint arXiv:2508.21258 (2025) 3

URLhttps://arxiv.org/abs/2508.21258. Harish Kamath, Emmanuel Ameisen, Isaac Kauvar, Rodrigo Luger, Wes Gurnee, Adam Pearce, Sam Zimmerman, Joshua Batson, Thomas Conerly, Chris Olah, and Jack Lindsey. Tracing attention computation through feature interactions.Transformer Circuits Thread,

work page arXiv
[3]

InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)

URLhttps://transformer-circuits.pub/2025/attention-qk/index.html. 11 Preprint. Under review. Adam Karvonen, James Chua, Clément Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, and Samuel Marks. Activation oracles: Training and evaluating llms as general-purpose activation explainers.arXiv...

work page doi:10.1145/3600006.3613165 2025
[4]

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D

URLhttps://openreview.net/pdf?id=NpsVSN6o4ul. Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-...

2025
[5]

sequential

URLhttps://openreview.net/forum?id=K2CckZjNy0. 13 Preprint. Under review. Appendix Table of Contents A Circuit tracing backbone 15 B Systems details and benchmarking results 15 C Non-locality in MLP neurons 16 D Additional results oncapitalsdataset 17 D.1 Ablations on threshold selection in attribution descriptions . . . . . . . . . 17 D.2 Detailed result...

2026