pith. sign in

arxiv: 2607.01940 · v1 · pith:KNWDYP3Knew · submitted 2026-07-02 · 💻 cs.LG · cs.AI

Conditional Co-Ablation: Recovering Self-Repair Backups in Transformer Circuits

Pith reviewed 2026-07-03 17:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords conditional co-ablationself-repairtransformer circuitsmechanistic interpretabilitybackup headsablation scoringIOI circuitstructured pruning
0
0 comments X

The pith

Conditional co-ablation recovers the backup heads that maintain function after primary removal in transformer circuits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard ablation scores treat component importance as an additive property of isolated units, yet transformers often self-repair: after a primary component is removed, a dormant backup activates and masks the primary's measured effect. The paper recasts this as a recovery task and introduces Conditional Co-Ablation, which scores remaining units by the growth in their ablation effect once a primary set has been removed. This conditional score exposes the second-order interactions that single-unit methods discard. On the GPT-2-small IOI circuit the method raises backup-head recovery to 0.91 ROC-AUC and the same label-free procedure works for induction heads across eight models. If the claim holds, attribution, capability knockout, and structured pruning can be made repair-aware without task labels.

Core claim

Component importance is not merely an isolated-unit property; in robust circuits the components that matter become visible only under the interventions that make them necessary. Conditional Co-Ablation computes, for each remaining unit, how much its ablation effect grows after a primary set is removed; the resulting score recovers the heads that causally carry the repair, as confirmed by counterfactual patching.

What carries the argument

Conditional Co-Ablation (CoAx), a label-free output-grounded score that quantifies the increase in each remaining unit's ablation effect once a primary set has been removed.

If this is right

  • Recovered backups correct self-repair-masked attribution scores.
  • The same units identify the minimal set required for capability knockout.
  • Repair-aware structured pruning scales from 124M to 7B parameters.
  • The label-free procedure transfers to induction-head circuits across eight models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Circuit analysis pipelines may need conditional interventions as a routine second pass rather than relying on single ablations.
  • The same growth metric could be applied to identify redundant pathways in non-transformer architectures.
  • Repair-aware pruning informed by these scores may preserve performance on downstream tasks better than magnitude-based methods.

Load-bearing premise

Growth in ablation effect after primary removal specifically identifies the causally responsible backup components rather than other correlated units.

What would settle it

Counterfactual patching experiments in which high-CoAx units fail to restore the original behavior when the primary set is removed.

Figures

Figures reproduced from arXiv: 2607.01940 by Chau Yuen, Wei Yang Bryan Lim, Zhiren Gong, Zihao Zeng.

Figure 1
Figure 1. Figure 1: COAX exposes the self-repair backups first-order scoring misses, and the same score pays off downstream on GPT-2-small. (a) The documented backups sit in the first-order blind spot, lifting backup ROC-AUC 0.33→0.91. (b) First-order leaves the key backups in the blind spot while COAX ranks them at the top, whereas an inactive head stays low under both. (c) Ablating the primaries barely moves the IOI answer … view at source ↗
Figure 2
Figure 2. Figure 2: The second-order structure COAX exploits, and the circuit it is (GPT-2-small). (a) First-order scoring gives one number per head; (b) the pairwise synergy Iuv is a whole matrix, where the name-movers and their backups form a bright off-diagonal block (boxed), the self-repair module a per-head score cannot see. (c,d) The primaries write the answer while the backups stay dormant (c) while ablating the primar… view at source ↗
Figure 3
Figure 3. Figure 3: How the discovered backups take over (GPT-2-small). As primaries are ablated (k), the backups grow in output norm (a) and conditional causal effect (b) while random heads stay flat; the answer’s direct logit attribution hands off to them (c); and freezing their dormant activations removes 55% of the self-repair (d), confirming the wake-up is causal. Completeness. A circuit is complete if it reproduces the … view at source ↗
Figure 4
Figure 4. Figure 4: The backup discovery generalizes along two axes. (a) Scale: the discovered IOI backups wake up under primary ablation across the GPT-2 family (blue, above the rest of the model in grey). (b) Architecture: label-free induction completion on eight models from six families, where COAX drops the induction log-probability far more than matched-random. (c) A +own control: both +COAX and +own sit far above the ra… view at source ↗
Figure 5
Figure 5. Figure 5: Repair-aware pruning across scales (WikiText-2 perplexity vs. heads pruned, log). (a– d) On four models from 124M to 7B, classical and gradient baselines degrade sharply while the self-repair-aware co-ablation order (blue) stays nearest dense at every scale; the knockout payoff of the same score is in Figure 1c. Zero-shot accuracy and the full sweep: Appendix E [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Backup-recovery quality vs. compute cost (forward-pass-equivalents; one backward = two forwards; GPT-2-small, |U|=144, L=12). COAX is the only score that reaches 0.91, is forward￾only and label-free, and – crucially – its conditional formulation recovers the second-order backup signal at O(|U|), far to the left of the O(|U|2 ) wall of the explicit pairwise synergy it replaces. Gradient baselines are cheape… view at source ↗
Figure 7
Figure 7. Figure 7: collects three diagnostics of the score. (a) Data efficiency. Because COAX is calibration￾only, a practical question is how many prompts it needs: it is already strong at 32 unlabeled prompts (0.90 AUC) and saturates by ∼ 64 (even 16 reach 0.86), so label-free discovery is also data-cheap. (b) What is load-bearing. The one design choice that matters is centering the features against the output distribution… view at source ↗
Figure 8
Figure 8. Figure 8: Why COAX reaches 0.91: the same backups, two scores. Each point is one of the 144 GPT-2-small heads (141 non-seed candidates plus the 3 primary seeds, shown for context but excluded from the AUC), placed by its rank percentile under first-order single-ablation saliency (left) and under the conditional co-ablation COAX score (right). The documented backup name-movers (blue stars) rank near the bottom under … view at source ↗
Figure 9
Figure 9. Figure 9: The first-order blind spot recurs across the GPT-2 family. Following the per-head, across-scale style of mechanistic-interpretability surveys (Olsson et al., 2022), each panel plots ev￾ery attention head of a model by its first-order single-ablation saliency (x, rank percentile) against its COAX conditional-growth score (y, rank percentile). The shaded blind spot – low saliency, high conditional growth – i… view at source ↗
Figure 10
Figure 10. Figure 10: Pairwise co-ablation synergy Suv over IOI circuit heads, grouped by role. The name￾movers and their backups form a high-synergy block (blue outline) invisible to single-ablation saliency – the off-diagonal interaction the COAX score reads (Proposition 3). C.2.3 SIGNATURE FILTER: PRECISION The two label-free signatures (output-norm wake-up under primary ablation, conditional causal drop) are not only descr… view at source ↗
Figure 11
Figure 11. Figure 11: The independent structural read, made visible. Final-position attention to each token role, averaged over 96 IOI prompts. The whole name-mover family – primaries and the COAX￾discovered backups – attends to the IO name (blue box); a random head does not, and the S column stays low (they read IO, not S). This ablation-free read, uncorrelated with the COAX score (ρ=0.09), independently lights up the same ba… view at source ↗
Figure 12
Figure 12. Figure 12: Anatomy of one backup head ([10, 6], GPT-2-small). (a) Final-position attention (mean over 96 IOI prompts, BOS sink excluded): the backup reads the IO name (0.29) far more than the S name or other content, just like the primary name-mover [9, 9] (0.61); the first-order false positive [1, 10] reads neither. (b) Direct-logit contribution to the IO−S direction: the backup is near zero on the clean model (dor… view at source ↗
Figure 13
Figure 13. Figure 13: Token-level attention pattern. Attention from the final (prediction) position to every source token, averaged over 40 name fillings of a fixed IOI template, for a primary name-mover, the COAX-recovered backup [10, 6], and the negative name-mover. All three concentrate on the IO name (boxed) – the defining name-mover behavior – not the subject name S (which appears twice) or the filler tokens. The backup c… view at source ↗
Figure 14
Figure 14. Figure 14: The self-repair circuit COAX recovers, as re-wiring (GPT-2-small IOI). Heads are nodes at their layer (y-axis) and function (x); wires are the documented composition edges; the prompt tokens sit at the bottom and the prediction at the top. (a) On the intact model the primary name-movers (9.9, 9.6, 10.0) read the IO name and write the answer (orange path); the backup name-movers are wired in but dormant. (… view at source ↗
Figure 15
Figure 15. Figure 15: The IOI circuit as token×layer information flow, with the COAX backup route. Each column is a prompt token, each row a layer; gray nodes are the residual stream. The repeated-subject signal is carried up and into the END position by duplicate-token, induction, and S-inhibition heads (olive); the primary name-movers (orange) read the IO name “Mary” and write it to the answer logit. COAX adds the backup nam… view at source ↗
Figure 16
Figure 16. Figure 16: The IOI circuit and the self-repair backups COAX recovers. A first-order analysis returns the orange path: duplicate-token / induction / previous-token heads feed the S-inhibition heads, which gate the primary name-movers that write the IO name to the logits. COAX adds the blue path – the backup name-movers, dormant on the intact model, form a parallel route to the IO logit that activates once the primari… view at source ↗
Figure 17
Figure 17. Figure 17: Where the circuit lives, and the first-order/second-order inversion. The 12×12 head grid scored by first-order single-ablation saliency (left) and by the conditional co-ablation COAX score (right); primary name-movers (orange squares) and COAX-recovered backups (blue circles) are marked. Both bands sit in layers 9–11, but they separate by signal: the backups are dark (low) under first-order saliency – the… view at source ↗
Figure 18
Figure 18. Figure 18: Cross-scale, cross-architecture geometry across twelve models. Each cell is the mar￾gin (co-ablation − co-activation VS-ACTIVE ROC-AUC) on a movement circuit: blue where the output-grounded co-ablation lens wins, orange where the input-side co-activation lens wins. The pattern is consistent across scale and architecture and is set by the circuit mechanism, not the model – co-ablation wins the output-movem… view at source ↗
read the original abstract

Mechanistic interpretability often relies on component-level interventions to discover how a model produces a behavior. This guides attribution, capability knockout, and model pruning downstream to operate by scoring each unit by the effect of ablation in isolation. Such first-order scoring is natural when component importance is additive, but becomes misleading when a transformer self-repairs: after a primary component is removed, a dormant backup can take over, muting the primary's measured effect while the backup itself appears irrelevant on the intact model. We recast this failure as a recovery task, conditional circuit completion, and introduce Conditional Co-Ablation (CoAx), a label-free, output-grounded score that asks how much each remaining unit's ablation effect grows once a primary set has been removed. This conditional growth exposes the second-order interaction that single-unit scores discard. On the GPT-2-small IOI circuit, CoAx raises backup-head recovery from 0.33 to 0.91 ROC-AUC, outperforming all baselines, including self-repair-aware gradient scores (best 0.82); counterfactual patching verifies that the recovered heads causally carry the repair. The same label-free procedure transfers to induction across eight models. Beyond discovery, the recovered backups correct self-repair-masked attribution, identify the components required for capability knockout, and yield repair-aware structured pruning scaling from 124M to 7B. Component importance is therefore not merely an isolated-unit property: in robust circuits, the components that matter can become visible only under the interventions that make them necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Conditional Co-Ablation (CoAx), a label-free score that recovers backup components in self-repairing transformer circuits by quantifying the growth in a unit's ablation effect after a primary set is removed. On the GPT-2-small IOI circuit, CoAx achieves 0.91 ROC-AUC for backup-head recovery (vs. 0.33 for standard ablation and 0.82 for best gradient baseline); counterfactual patching confirms the recovered heads causally implement the repair. The method transfers to induction-head recovery across eight models and improves downstream tasks including attribution correction, capability knockout, and structured pruning up to 7B scale.

Significance. If the empirical results hold, the work provides a concrete advance for mechanistic interpretability by directly addressing self-repair, a known failure mode of first-order ablation scoring. Credit is due for the causal verification step via counterfactual patching on the IOI circuit and for the label-free transfer demonstration across multiple models and scales; these elements strengthen the central claim beyond correlational recovery metrics.

major comments (2)
  1. [§4.2] §4.2 (CoAx definition and IOI evaluation): the ROC-AUC metric for backup recovery requires an explicit positive set of ground-truth backup heads and exclusion criteria; without these details the reported lift from 0.33 to 0.91 cannot be reproduced or compared to the gradient baseline of 0.82.
  2. [§5] §5 (counterfactual patching verification): while patching is used to confirm causal contribution of CoAx-recovered heads, the manuscript does not report the magnitude of the repair effect restored or the fraction of total self-repair recovered; this is load-bearing for the claim that the recovered heads 'causally carry the repair'.
minor comments (3)
  1. The induction transfer experiment mentions eight models but provides no per-model breakdown or aggregate statistics; a table summarizing ROC-AUC per model would clarify the scope of generalization.
  2. No error bars, number of random seeds, or statistical tests accompany the ROC-AUC numbers or pruning scaling curves; adding these would strengthen the performance claims without altering the central argument.
  3. [§3] Notation for the conditional growth term in the CoAx score (difference of ablation effects) should be defined with an equation number for clarity when contrasting with first-order scores.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. We address each major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (CoAx definition and IOI evaluation): the ROC-AUC metric for backup recovery requires an explicit positive set of ground-truth backup heads and exclusion criteria; without these details the reported lift from 0.33 to 0.91 cannot be reproduced or compared to the gradient baseline of 0.82.

    Authors: We agree that explicit specification of the positive set and exclusion criteria is required for reproducibility. The current manuscript defines the positive set as the heads previously identified as backups in the IOI circuit literature (specifically the set of heads whose activation increases after primary ablation) and applies exclusion criteria based on attention-head type and layer range; however, these details are only summarized rather than enumerated. In the revision we will add a dedicated paragraph in §4.2 listing the exact positive-set heads, the exclusion rules, and the precise procedure used to compute the ROC-AUC values, enabling direct reproduction of the 0.91 result and comparison against the 0.82 gradient baseline. revision: yes

  2. Referee: [§5] §5 (counterfactual patching verification): while patching is used to confirm causal contribution of CoAx-recovered heads, the manuscript does not report the magnitude of the repair effect restored or the fraction of total self-repair recovered; this is load-bearing for the claim that the recovered heads 'causally carry the repair'.

    Authors: We concur that reporting the magnitude of the restored repair effect and the fraction of total self-repair recovered would strengthen the causal claim. The existing counterfactual patching experiment demonstrates that patching the CoAx-recovered heads restores a statistically significant portion of the original circuit behavior, but does not quantify the absolute logit difference or the percentage of the full self-repair gap closed. In the revised §5 we will add these two metrics (restored logit delta and fraction of total repair recovered) computed on the same IOI task, together with the corresponding numbers for the primary heads alone, to make the load-bearing claim fully supported by the reported evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines CoAx directly as a conditional ablation growth score computed from model interventions on the intact and ablated network; this is an empirical procedure, not a derivation that reduces by the paper's own equations to a fitted parameter or prior result. Central claims rest on measured ROC-AUC gains (0.33 to 0.91) and counterfactual patching verification on the IOI circuit, which are independent of any self-citation chain or self-definitional loop. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear as load-bearing steps. The method is self-contained against external benchmarks via patching.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method is presented as a direct computation from ablation interventions.

pith-pipeline@v0.9.1-grok · 5819 in / 1112 out tokens · 29802 ms · 2026-07-03T17:29:27.655552+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author =. 2021 , journal=

  2. [2]

    2022 , eprint =

    In-context Learning and Induction Heads , author =. 2022 , eprint =

  3. [3]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , year =. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2211.00593 , archivePrefix =

  4. [4]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  5. [5]

    2023 , eprint =

    Attribution Patching Outperforms Automated Circuit Discovery , author =. 2023 , eprint =

  6. [6]

    Conference on Language Modeling (COLM) , year =

    Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms , author =. Conference on Language Modeling (COLM) , year =

  7. [7]

    2024 , eprint =

    Kram. 2024 , eprint =

  8. [8]

    2024 , eprint =

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author =. 2024 , eprint =

  9. [9]

    2024 , eprint =

    Information Flow Routes: Automatically Interpreting Language Models at Scale , author =. 2024 , eprint =

  10. [10]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Finding Transformer Circuits with Edge Pruning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  11. [11]

    International Conference on Learning Representations (ICLR) , year =

    Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition , author =. International Conference on Learning Representations (ICLR) , year =

  12. [12]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Neuron Shapley: Discovering the Responsible Neurons , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  13. [13]

    2023 , eprint =

    The Hydra Effect: Emergent Self-repair in Language Model Computations , author =. 2023 , eprint =

  14. [14]

    International Conference on Machine Learning (ICML) , year =

    Explorations of Self-Repair in Language Models , author =. International Conference on Machine Learning (ICML) , year =

  15. [15]

    2023 , eprint =

    Copy Suppression: Comprehensively Understanding an Attention Head , author =. 2023 , eprint =

  16. [16]

    How Does

    Hanna, Michael and Liu, Ollie and Variengien, Alexandre , booktitle =. How Does. 2023 , note =

  17. [17]

    Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation (

    Edin, Joakim and Csord. Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation (. 2025 , eprint =

  18. [18]

    2026 , eprint =

    Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers , author =. 2026 , eprint =

  19. [19]

    and Wang, Jun and Luppi, Andrea I

    Urbina-Rodriguez, Pedro and Fountas, Zafeirios and Rosas, Fernando E. and Wang, Jun and Luppi, Andrea I. and Bou-Ammar, Haitham and Shanahan, Murray and Mediano, Pedro A. M. , year =. A Brain-like Synergistic Core in. 2601.06851 , archivePrefix =

  20. [20]

    2026 , eprint =

    Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel , author =. 2026 , eprint =

  21. [21]

    2504.13151 , archivePrefix =

    Mueller, Aaron and Geiger, Atticus and Wiegreffe, Sarah and others , year =. 2504.13151 , archivePrefix =

  22. [22]

    Conference on Language Modeling (COLM) , year =

    Transformer Circuit Faithfulness Metrics Are Not Robust , author =. Conference on Language Modeling (COLM) , year =

  23. [23]

    International Conference on Machine Learning (ICML) , year =

    Axiomatic Attribution for Deep Networks , author =. International Conference on Machine Learning (ICML) , year =

  24. [24]

    2023 , eprint =

    A Simple and Effective Pruning Approach for Large Language Models , author =. 2023 , eprint =

  25. [25]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    A Fast Post-Training Pruning Framework for Transformers , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  26. [26]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Importance Estimation for Neural Network Pruning , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  27. [27]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Optimal Brain Damage , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  28. [28]

    International Conference on Machine Learning (ICML) , year=

    SubspacePath Pruner: Inference-time Pruning via Probe-based Representation--Parameter Coupling , author=. International Conference on Machine Learning (ICML) , year=

  29. [29]

    Biometrics , volume =

    Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach , author =. Biometrics , volume =

  30. [30]

    Neural Computation , volume =

    Natural Gradient Works Efficiently in Learning , author =. Neural Computation , volume =

  31. [31]

    2020 , howpublished =

    Interpreting. 2020 , howpublished =

  32. [32]

    International Conference on Learning Representations (ICLR) , year =

    Towards Best Practices of Activation Patching in Language Models: Metrics and Methods , author =. International Conference on Learning Representations (ICLR) , year =

  33. [33]

    Distill , year =

    Zoom In: An Introduction to Circuits , author =. Distill , year =

  34. [34]

    Mechanistic Interpretability for

    Bereska, Leonard and Gavves, Efstratios , journal=. Mechanistic Interpretability for. 2024 , note =

  35. [35]

    OpenAI technical report , year =

    Language Models are Unsupervised Multitask Learners , author =. OpenAI technical report , year =

  36. [36]

    International Conference on Machine Learning (ICML) , year =

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. International Conference on Machine Learning (ICML) , year =

  37. [37]

    Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , year =

  38. [38]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv:2408.00118 , year =

  39. [39]

    2 OLMo 2 Furious

    2. arXiv:2501.00656 , year =

  40. [40]

    The Llama 3 Herd of Models

    The Llama 3 Herd of Models , author =. arXiv:2407.21783 , year =

  41. [41]

    Qwen2.5 Technical Report

    Qwen2.5 Technical Report , author =. arXiv:2412.15115 , year =

  42. [42]

    International Conference on Learning Representations (ICLR) , year =

    Pointer Sentinel Mixture Models , author =. International Conference on Learning Representations (ICLR) , year =

  43. [43]

    Bisk, Yonatan and Zellers, Rowan and Le Bras, Ronan and Gao, Jianfeng and Choi, Yejin , booktitle =

  44. [44]

    Think You Have Solved Question Answering? Try

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think You Have Solved Question Answering? Try

  45. [45]

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle =