Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

Yongzhong Xu

arxiv: 2605.24059 · v2 · pith:2D622J4Snew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

Yongzhong Xu This is my paper

Pith reviewed 2026-06-30 15:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords attention head circuitstransformer interpretabilityspectral analysiscircuit discoveryinduction headsgroup ablationparticipation ratiopretrained models

0 comments

The pith

A three-step spectral recipe identifies a 2-6 head induction circuit that is causally necessary in every tested transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a label-free method to locate the small groups of attention heads that carry out a given computation inside a pretrained transformer. A spectral measure of each head's attention output first flags heads engaged in sustained content-dependent work. A task-specific screen then narrows the list, and targeted group ablation against random controls tests whether those heads are actually required. When applied to the induction task, the procedure finds a circuit of only 2-6 heads whose removal destroys performance in models ranging from 51 million to 1 billion active parameters.

Core claim

The time-integrated participation ratio of each head's attention output ranks heads by sustained content-dependent computation. A task-pattern screen selects candidates from this ranking, and group ablation against a matched-random control shows that a 2-6 head induction circuit is causally necessary for synthetic-induction top-1 accuracy, producing a 94-100 percent drop in every model tested across an 8x parameter range, two architecture families, and four pretraining pipelines. The same unsupervised signal recovers the correct seed-specific circuit on each of six independent 51M-parameter runs.

What carries the argument

The per-head spectral signal, the time-integrated participation ratio of each head's attention output, which ranks heads performing sustained content-dependent computation without labels or gradients.

If this is right

A circuit of only 2-6 heads is causally necessary for the induction task in every model tested, regardless of size or architecture.
The fraction of heads showing identifiable specialized computation stays fixed at 17-19 percent across the Pythia family while total head count grows.
Induction circuits remain small (3-11 heads) and do not scale linearly with model size.
The spectral signal identifies the correct circuit on each independent training seed without any task supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-step procedure could be used to track when and how induction circuits form during pretraining.
The recipe may separate pattern-selective heads from those that are actually required for task success in more complex composed behaviors.
Spectral signals of this kind could locate circuits for other common operations such as copying or factual recall.

Load-bearing premise

The time-integrated participation ratio of a head's attention output reliably marks heads that perform sustained content-dependent computation.

What would settle it

Ablating the heads ranked highest by the spectral signal produces no larger drop in synthetic-induction accuracy than ablating an equal number of randomly chosen heads.

Figures

Figures reproduced from arXiv: 2605.24059 by Yongzhong Xu.

**Figure 1.** Figure 1: Capability circuits emerge early in pretraining and the per-head spectral signal precedes their formation. (A) Per-head spectral signal at each checkpoint – max(PRt − 1, 0), the integrand of the PR-integral ranking statistic defined in §3.2, plotted per-checkpoint rather than as a cumulative sum so the temporal structure of emergence is visible. Three identified heads in Pythia 1B: the induction head L4·H4… view at source ↗

**Figure 2.** Figure 2: The TS-51M six-seed validation, in four panels. (A) Per-head spectral signal (PR effective rank, max(PRt − 1, 0)) at each pretraining step for all 128 heads of TS-51M seed s42, plotted as a heatmap. The four L0 heads L0·H{3, 6, 14, 15} (labeled in red) stand out from the rest of the model around step 800, coincident with the probe-task emergence event (top panel, probe_in_acc). (B) Mechanistic confirmation… view at source ↗

**Figure 3.** Figure 3: The induction circuit identified on FineWeb is invariant to OWT continuation training. (A) Per-head PR trajectory across both phases (FineWeb steps 0–17600, OWT steps 0–8000). The five FineWeb-endpoint induction-circuit heads (L8·H8, L8·H5, L8·H6, L8·H10, L7·H4) maintain elevated PR across the boundary; no head transitions out of the circuit. (B) Whole-model BOS-class fraction (fraction of heads classified… view at source ↗

**Figure 4.** Figure 4: Per-model ablation-floor curves: at what threshold T does ablating all heads with induction-selectivity ≥ T× hit the ablation floor? Y-axis is fraction of baseline syntheticinduction top-1 retained after ablation; x-axis is T (decreasing leftward, so progressively more heads get included as one moves left). OLMo 1B (orange): the curve is already at ∼ 5% retention by T=100 (n=2 heads ablated) and stays fla… view at source ↗

read the original abstract

We present a three-step recipe for identifying attention-head circuits in pretrained transformers. A per-head spectral signal -- the time-integrated participation ratio of each head's attention output -- ranks heads doing sustained content-dependent computation without labels or attribution gradients. A task-pattern screen filters this general indicator into a task-specific candidate circuit, and group ablation against a matched-random control completes the causal claim. We validate across an 8x parameter range (51M to 1B-active / 7B-total), two architecture families (dense, mixture-of-experts), and four pretraining pipelines. The recipe ports: a 2-6 head induction circuit is causally necessary in every model tested, with a 94-100% drop in synthetic-induction top-1 after ablation. The spectral signal is predictive without supervision: on six independent seeds of a 51M-parameter probe model, the same computation identifies the seed-specific circuit on each seed. The fraction of heads doing identifiable specialized computation is conserved at 17-19% across the Pythia family (124M to 410M), while specific induction circuits stay 3-11 heads -- sublinear in total head count. This paper is the methodology anchor of a three-paper program; companion papers extend the recipe to developmental trajectories during pretraining and to composed-task circuits where pattern selectivity decouples from task-causal structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three-step spectral recipe recovers induction circuits consistently across scales but rests on an unproven link between participation ratio and content-dependent computation.

read the letter

The paper's core contribution is a label-free ranking of attention heads via time-integrated participation ratio of their outputs, followed by task-pattern screening and matched-random ablation. It reports that this finds a small 2-6 head induction circuit that is causally necessary in every tested model from 51M to 1B parameters, with 94-100% performance drops on ablation, and that the same unsupervised signal picks seed-specific circuits reliably.

What stands out is the scale of the validation: consistent results across dense and MoE architectures, multiple pretraining runs, and an 8x parameter range, plus the observation that specialized head fraction stays roughly constant while circuit size grows sublinearly. The matched control and cross-seed checks are concrete steps that strengthen the causal part.

The weakest part is the opening claim that the participation ratio ranks heads doing sustained content-dependent computation. Participation ratio measures effective dimensionality of the output covariance, but the paper does not demonstrate why elevated values track task-relevant selectivity rather than variance, attention spread, or positional effects. Because only the top-ranked heads proceed to screening and ablation, any mismatch at step one carries through. Recovery of known induction circuits is reassuring but does not isolate whether the spectral step itself is doing the right work.

This is aimed at mechanistic interpretability researchers who need scalable circuit-finding tools that avoid heavy supervision. The cross-model evidence is strong enough that a serious referee should see the full implementation details, statistical reporting, and any additional controls on the spectral signal. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents a three-step recipe for identifying attention-head circuits in pretrained transformers. Step 1 computes a per-head spectral signal (time-integrated participation ratio of each head's attention output) asserted to rank heads performing sustained content-dependent computation without labels or gradients. Step 2 applies a task-pattern screen to obtain task-specific candidates. Step 3 performs group ablation against a matched-random control to establish causality. Validation across 51M–1B parameter models (dense and MoE), multiple seeds, and pretraining pipelines shows that a 2–6 head induction circuit is causally necessary, producing 94–100% drops in synthetic-induction top-1 accuracy; the signal recovers seed-specific circuits without supervision and the fraction of specialized heads remains 17–19% while induction circuits stay sublinear in head count.

Significance. If the spectral signal is shown to be a reliable label-free ranker, the recipe supplies a scalable, gradient-free methodology for circuit discovery that generalizes across scales and architectures. The reported consistency across an 8× parameter range, two architecture families, and independent seeds, together with the sublinear scaling observation, would provide a concrete foundation for the companion papers on developmental trajectories and composed-task circuits.

major comments (2)

[Abstract / spectral signal definition] Abstract / spectral-signal definition: the central assertion that the time-integrated participation ratio 'ranks heads doing sustained content-dependent computation without labels or attribution gradients' is presented without direct evidence or theoretical argument showing why elevated participation ratio must reflect content selectivity rather than higher output variance, broader attention distributions, or position-dependent effects. Because the task-pattern screen and ablation are applied only to the top-ranked heads, any spurious ranking at this step propagates directly into the causal claim.
[Abstract / ablation results] Abstract / ablation protocol: the reported 94–100% performance drop after ablating the 2–6 head induction circuit is load-bearing for the causal necessity claim, yet the precise ablation protocol (which heads, how many matched-random controls, statistical tests, and variance across seeds) is not detailed enough to evaluate whether the drop exceeds what the matched-random baseline already predicts.

minor comments (2)

[Abstract] The distinction between '1B-active / 7B-total' parameters should be clarified with an explicit statement of active vs. total parameter counts for each model family.
[Methods] The participation-ratio formula (trace² / ||·||_F² of the output covariance) is standard but its time-integration window and covariance estimation details are not stated; adding these would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where the manuscript requires greater theoretical grounding and experimental detail. We address each major comment below and will revise the paper accordingly to strengthen the presentation of the spectral signal and ablation protocol.

read point-by-point responses

Referee: [Abstract / spectral signal definition] Abstract / spectral-signal definition: the central assertion that the time-integrated participation ratio 'ranks heads doing sustained content-dependent computation without labels or attribution gradients' is presented without direct evidence or theoretical argument showing why elevated participation ratio must reflect content selectivity rather than higher output variance, broader attention distributions, or position-dependent effects. Because the task-pattern screen and ablation are applied only to the top-ranked heads, any spurious ranking at this step propagates directly into the causal claim.

Authors: We agree that the manuscript would benefit from an explicit theoretical argument linking the participation ratio (PR) to content-dependent computation. The time-integrated PR measures the effective dimensionality of each head's output over the sequence; low PR indicates outputs confined to a low-dimensional subspace, which in attention heads typically arises from sustained, input-selective transformations rather than diffuse variance or purely positional effects. We will add a concise theoretical paragraph in Section 3 (with supporting references to effective-rank analyses in neural representations) and revise the abstract to include this justification. This directly mitigates the propagation concern by bolstering the initial ranking step before the task screen and ablation. revision: yes
Referee: [Abstract / ablation results] Abstract / ablation protocol: the reported 94–100% performance drop after ablating the 2–6 head induction circuit is load-bearing for the causal necessity claim, yet the precise ablation protocol (which heads, how many matched-random controls, statistical tests, and variance across seeds) is not detailed enough to evaluate whether the drop exceeds what the matched-random baseline already predicts.

Authors: The referee is correct that the current description of the ablation protocol lacks sufficient detail for full evaluation. We will expand the methods section (and add a supplementary table) to specify: the precise heads ablated (top 2–6 after task-pattern screen), the number of matched-random controls (100 trials per model), the statistical tests (paired t-tests comparing circuit ablation to random baseline), and variance across seeds (mean ± std reported for the six 51M seeds and four pretraining pipelines). These additions will demonstrate that the observed 94–100% drops significantly exceed random-control predictions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines the per-head spectral signal explicitly as the time-integrated participation ratio of each head's attention output, computed from model activations without reference to task labels, gradients, or the final causal claims. The subsequent task-pattern screen and ablation (with matched-random control) are applied downstream, and validation consists of empirical recovery of known induction circuits plus cross-model consistency checks. No equation reduces the claimed ranking property to the result by construction, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on self-citation. The central recipe therefore does not collapse to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that ablating a small group of heads produces a clean causal effect without compensatory rerouting by the remaining network, and that the spectral participation ratio isolates sustained content-dependent computation.

axioms (1)

domain assumption Group ablation of attention heads isolates causal contributions to task performance without triggering compensatory mechanisms from other heads.
The final step of the recipe treats the performance drop after targeted ablation (versus random control) as direct evidence of necessity.

pith-pipeline@v0.9.1-grok · 5776 in / 1274 out tokens · 41955 ms · 2026-06-30T15:43:46.391625+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models
cs.LG 2026-06 unverdicted novelty 7.0

The same composed tasks are realized by different attention-head patterns in different models when the same selectivity-plus-ablation protocol is applied.
Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes
cs.LG 2026-06 unverdicted novelty 6.0

Co-activation clustering of attention heads proposes candidate circuits that pass causal closure validation in dense 1B models but fail in a Mixture-of-Experts model, where ablation can improve loss.
When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures
cs.LG 2026-06 unverdicted novelty 6.0

In 1B-class models on DCLM, induction-circuit formation precedes BOS-attractor formation by 10-20x tokens with qualitatively different emergence shapes across architectures.

Reference graph

Works this paper leans on

4 extracted references · cited by 3 Pith papers

[1]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, ...

2023
[2]

Successorheads: Recurring, interpretable attention heads in the wild

RhysGould, EuanOng, GeorgeOgden, andArthurConmy. Successorheads: Recurring, interpretable attention heads in the wild. InInternational Conference on Learning Representations, 2024. 33 Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. OLMo: Acceler- a...

2024
[3]

OLMoE: Open mixture-of-experts language models.arXiv preprint, 2024

Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. OLMoE: Open mixture-of-experts language models.arXiv preprint, 2024. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep

2024
[4]

In-context learning and induction heads

Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Ka- plan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Trans- former circuits thread, Anthropic, 2022. URL https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads/index.html. Kevin Wang, Alexandre Variengien, Arthur Conmy, Bu...

2022

[1] [1]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, ...

2023

[2] [2]

Successorheads: Recurring, interpretable attention heads in the wild

RhysGould, EuanOng, GeorgeOgden, andArthurConmy. Successorheads: Recurring, interpretable attention heads in the wild. InInternational Conference on Learning Representations, 2024. 33 Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. OLMo: Acceler- a...

2024

[3] [3]

OLMoE: Open mixture-of-experts language models.arXiv preprint, 2024

Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. OLMoE: Open mixture-of-experts language models.arXiv preprint, 2024. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep

2024

[4] [4]

In-context learning and induction heads

Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Ka- plan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Trans- former circuits thread, Anthropic, 2022. URL https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads/index.html. Kevin Wang, Alexandre Variengien, Arthur Conmy, Bu...

2022