Attribution Patching Outperforms Automated Circuit Discovery

Syed, Aaquib, Rager, Can, Conmy, Arthur · 2024 · DOI 10.18653/v1/2024.blackboxnlp-1.25

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open at publisher browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

MechaRule localizes agonist neurons in LLMs via contrastive hierarchical ablation to ground rule extraction in circuitry, recalling 96.8% of high-effect neurons and reducing task performance when suppressed.

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Unpack decomposes transformer credit via a unified backward recursion on the φ(S)U template, recovering known IOI circuits with mode labels and showing consistent duplicate-name suppression across Pythia scales from a single forward pass.

Playing the network backward: A Game Theoretic Attribution Framework

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Backward attribution is reframed as integrals over trajectories in a two-player game on the network, unifying gradients and alpha-beta-LRP while enabling new adaptations that outperform prior methods on ViT-B/16 localization metrics.

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

citing papers explorer

Showing 6 of 6 citing papers.

How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits cs.CL · 2026-05-08 · unverdicted · none · ref 1
Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation cs.LG · 2026-05-04 · unverdicted · none · ref 49
MechaRule localizes agonist neurons in LLMs via contrastive hierarchical ablation to ground rule extraction in circuitry, recalling 96.8% of high-effect neurons and reducing task performance when suppressed.
Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition cs.LG · 2026-05-22 · unverdicted · none · ref 17
Unpack decomposes transformer credit via a unified backward recursion on the φ(S)U template, recovering known IOI circuits with mode labels and showing consistent duplicate-name suppression across Pythia scales from a single forward pass.
Playing the network backward: A Game Theoretic Attribution Framework cs.LG · 2026-05-07 · unverdicted · none · ref 4
Backward attribution is reframed as integrals over trajectories in a two-player game on the network, unifying gradients and alpha-beta-LRP while enabling new adaptations that outperform prior methods on ViT-B/16 localization metrics.
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal cs.LG · 2026-04-09 · unverdicted · none · ref 34
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 295
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Attribution Patching Outperforms Automated Circuit Discovery

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer