Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

· 2026 · cs.LG · arXiv 2605.24059

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We present a three-step recipe for identifying attention-head circuits in pretrained transformers. A per-head spectral signal -- the time-integrated participation ratio of each head's attention output -- ranks heads doing sustained content-dependent computation without labels or attribution gradients. A task-pattern screen filters this general indicator into a task-specific candidate circuit, and group ablation against a matched-random control completes the causal claim. We validate across an 8x parameter range (51M to 1B-active / 7B-total), two architecture families (dense, mixture-of-experts), and four pretraining pipelines. The recipe ports: a 2-6 head induction circuit is causally necessary in every model tested, with a 94-100% drop in synthetic-induction top-1 after ablation. The spectral signal is predictive without supervision: on six independent seeds of a 51M-parameter probe model, the same computation identifies the seed-specific circuit on each seed. The fraction of heads doing identifiable specialized computation is conserved at 17-19% across the Pythia family (124M to 410M), while specific induction circuits stay 3-11 heads -- sublinear in total head count. This paper is the methodology anchor of a three-paper program; companion papers extend the recipe to developmental trajectories during pretraining and to composed-task circuits where pattern selectivity decouples from task-causal structure.

representative citing papers

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

The same composed tasks are realized by different attention-head patterns in different models when the same selectivity-plus-ablation protocol is applied.

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

In 1B-class models on DCLM, induction-circuit formation precedes BOS-attractor formation by 10-20x tokens with qualitatively different emergence shapes across architectures.

citing papers explorer

Showing 2 of 2 citing papers.

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models cs.LG · 2026-06-03 · unverdicted · none · ref 18 · internal anchor
The same composed tasks are realized by different attention-head patterns in different models when the same selectivity-plus-ablation protocol is applied.
When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures cs.LG · 2026-06-01 · unverdicted · none · ref 15 · internal anchor
In 1B-class models on DCLM, induction-circuit formation precedes BOS-attractor formation by 10-20x tokens with qualitatively different emergence shapes across architectures.

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

fields

years

verdicts

representative citing papers

citing papers explorer