Recognition: unknown
ADAG: Automatically Describing Attribution Graphs
Pith reviewed 2026-05-10 17:17 UTC · model grok-4.3
The pith
An automated pipeline uses gradient profiles and LLM explanations to describe feature roles in language model attribution graphs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ADAG is an end-to-end pipeline that builds attribution profiles to quantify each feature's causal input and output effects, applies a clustering algorithm to group related features, and runs an LLM explainer-simulator loop to produce scored natural-language accounts of what each cluster does; when tested on established circuit-tracing benchmarks it reproduces human-interpretable circuits and additionally isolates steerable clusters that implement a harmful-advice jailbreak in Llama 3.1 8B Instruct.
What carries the argument
Attribution profiles that quantify a feature's functional role through its input and output gradient effects, together with a clustering algorithm and an LLM explainer-simulator that produces and scores natural-language descriptions of feature groups.
If this is right
- Known human-analyzed circuits can be recovered automatically without manual inspection of activation data.
- Steerable feature clusters responsible for specific model behaviors such as harmful advice can be located and described at scale.
- Circuit tracing can move from ad-hoc small studies to systematic, repeatable analysis across many tasks and models.
Where Pith is reading between the lines
- If the descriptions prove reliable, the same pipeline could be applied to audit larger models for unintended behaviors without first knowing what to look for.
- The method opens the possibility of comparing circuit structures across different model families to identify shared computational motifs.
- Repeated application might allow tracking how circuits for safety-related behaviors change during fine-tuning or alignment.
Load-bearing premise
That gradient-derived attribution profiles combined with LLM-generated explanations correctly capture the actual causal roles played by feature groups inside the model.
What would settle it
A new set of circuit-tracing tasks where independent human analysts produce circuit descriptions that systematically differ from the automated ADAG outputs in the functional roles assigned to the recovered clusters.
Figures
read the original abstract
In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ADAG, an end-to-end automated pipeline for describing attribution graphs in language models. It defines attribution profiles via input and output gradients to quantify feature functional roles, proposes a novel clustering algorithm for grouping features, and employs an LLM explainer-simulator to generate and score natural-language explanations of the resulting groups. The system is evaluated on known human-analyzed circuit-tracing tasks, where it claims to recover interpretable circuits, and is applied to identify steerable clusters responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.
Significance. If the claims are substantiated with quantitative validation, ADAG would represent a meaningful advance in scalable mechanistic interpretability by automating what has previously required ad-hoc human analysis of circuits. The combination of gradient-based profiles with LLM-assisted explanation and clustering is a creative direction that could reduce manual effort in identifying causal structures and undesirable behaviors. The absence of metrics and causal tests in the current version, however, limits the assessed significance.
major comments (2)
- [Abstract] Abstract: the claim that ADAG 'recovers interpretable circuits' on known human-analysed tasks provides no quantitative metrics (e.g., alignment scores with human annotations, precision/recall against ground-truth circuits, or comparison to baselines), nor details on how faithfulness of the LLM explanations was measured. This is load-bearing for the central claim of successful automation.
- [Abstract] Abstract: the assertion that the identified clusters are 'responsible for a harmful advice jailbreak' and 'steerable' lacks reported intervention or steering experiments that isolate the clusters' causal effect (e.g., ablation or activation patching results showing output change relative to controls). Gradient-based attribution profiles measure sensitivity but do not by themselves establish the required causality without such tests.
minor comments (1)
- [Abstract] Abstract: the term 'attribution profiles' is introduced without even a brief formal characterization or reference to the gradient definitions that will appear later; adding a short parenthetical or equation reference would improve immediate clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address each major comment below and commit to revisions that add the requested quantitative and causal evidence.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that ADAG 'recovers interpretable circuits' on known human-analysed tasks provides no quantitative metrics (e.g., alignment scores with human annotations, precision/recall against ground-truth circuits, or comparison to baselines), nor details on how faithfulness of the LLM explanations was measured. This is load-bearing for the central claim of successful automation.
Authors: We agree that quantitative metrics are necessary to substantiate the claim of successful automation. The current manuscript evaluates circuit recovery through qualitative alignment with previously published human-analyzed circuits (e.g., the indirect-object-identification circuit). No precision/recall figures, alignment scores, baseline comparisons, or explicit faithfulness metrics for the LLM explanations appear in the submitted version. We will add a dedicated quantitative evaluation subsection that reports precision and recall against ground-truth circuits, inter-annotator agreement for explanation faithfulness, and comparisons against standard clustering baselines. revision: yes
-
Referee: [Abstract] Abstract: the assertion that the identified clusters are 'responsible for a harmful advice jailbreak' and 'steerable' lacks reported intervention or steering experiments that isolate the clusters' causal effect (e.g., ablation or activation patching results showing output change relative to controls). Gradient-based attribution profiles measure sensitivity but do not by themselves establish the required causality without such tests.
Authors: We accept that attribution profiles alone demonstrate sensitivity rather than direct causality. The manuscript identifies candidate clusters via the gradient profiles and LLM explainer-simulator and asserts steerability on that basis, but does not report ablation or activation-patching experiments with control conditions. We will incorporate such experiments in the revision, including cluster ablation and targeted activation patching that quantify the change in harmful-advice generation probability relative to matched controls. revision: yes
Circularity Check
No circularity in ADAG pipeline derivation
full rationale
The paper describes an empirical end-to-end pipeline consisting of gradient-based attribution profiles, a clustering algorithm, and an LLM explainer-simulator. No equations, derivations, or self-referential definitions are present that would reduce any claimed result to its inputs by construction. Evaluations on known circuit-tracing tasks and the jailbreak example are presented as external recoveries and demonstrations rather than tautological predictions or fitted renamings. Any self-citations are incidental and not load-bearing for the core method, which relies on standard gradient computations and off-the-shelf LLMs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Input and output gradient effects are sufficient to quantify the functional role of a feature for clustering purposes.
invented entities (2)
-
attribution profiles
no independent evidence
-
LLM explainer-simulator setup
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
URLhttps://arxiv.org/abs/2309.08600. 10 Preprint. Under review. Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.),Ad- vances in Neural Information Processing Systems 38: Annual Conf...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5555/1953048.2021071 2024
-
[2]
arXiv preprint arXiv:2508.21258 (2025) 3
URLhttps://arxiv.org/abs/2508.21258. Harish Kamath, Emmanuel Ameisen, Isaac Kauvar, Rodrigo Luger, Wes Gurnee, Adam Pearce, Sam Zimmerman, Joshua Batson, Thomas Conerly, Chris Olah, and Jack Lindsey. Tracing attention computation through feature interactions.Transformer Circuits Thread,
-
[3]
InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)
URLhttps://transformer-circuits.pub/2025/attention-qk/index.html. 11 Preprint. Under review. Adam Karvonen, James Chua, Clément Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, and Samuel Marks. Activation oracles: Training and evaluating llms as general-purpose activation explainers.arXiv...
-
[4]
Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D
URLhttps://openreview.net/pdf?id=NpsVSN6o4ul. Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-...
2025
-
[5]
sequential
URLhttps://openreview.net/forum?id=K2CckZjNy0. 13 Preprint. Under review. Appendix Table of Contents A Circuit tracing backbone 15 B Systems details and benchmarking results 15 C Non-locality in MLP neurons 16 D Additional results oncapitalsdataset 17 D.1 Ablations on threshold selection in attribution descriptions . . . . . . . . . 17 D.2 Detailed result...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.