pith. sign in

arxiv: 2605.23393 · v1 · pith:KCE57FKFnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

Pith reviewed 2026-05-25 05:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords mechanistic interpretabilitytransformer decompositiontoken attributioncircuit discoveryindirect object identificationattention compositionduplicate suppression
0
0 comments X

The pith

A backward recursion exploiting the shared key-value template decomposes every transformer component into end-to-end paths and token attributions from one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that attention and MLP sublayers share a key-value structure that can be inverted through a single backward recursion. This produces interaction strengths between any pair of components, full paths labeled by K/Q/V routing, and per-token credit assignments without gradients or interventions. On the indirect object identification task the recursion recovers the three known composition connections in GPT-2 small together with their routing modes. The same procedure extracts a consistent duplicate-name suppression pattern across every Pythia model from 160 M to 6.9 B parameters, showing that the decomposition tracks mechanistic structure at scale.

Core claim

Every component functions as a lookup under the shared template φ(S)U; the Unpack recursion therefore decomposes credit through both attention and MLP sublayers in one pass, yielding end-to-end paths with explicit K/Q/V composition labels and per-token attributions that recover the indirect-object circuit and the duplicate-suppression pattern at every tested scale.

What carries the argument

Unpack backward recursion, which inverts the shared key-value template φ(S)U to compute interaction strengths and route labels between any two components.

If this is right

  • All three composition connections and their K/Q/V routing modes are recovered on GPT-2 small.
  • Duplicate name mentions receive strong credit at the first occurrence and suppressed credit at the second, a pattern absent in control prompts.
  • The suppression pattern appears at every scale in the Pythia family from 160 M to 6.9 B parameters.
  • Per-token attribution distinguishes non-trivial duplicate detection from simple copying.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The uniform treatment of attention and MLP suggests the same recursion could be applied to tasks lacking prior circuit labels.
  • If the template generalizes, the method supplies a way to compare mechanistic routes across models without retraining auxiliary probes.
  • Token-level suppression patterns recovered at every scale imply that duplicate detection is a stable computational motif rather than an artifact of small models.

Load-bearing premise

The shared key-value template holds equally for attention and MLP sublayers so that the backward recursion produces accurate paths without extra checks.

What would settle it

Running Unpack on a model where the recovered paths or token attributions systematically disagree with results obtained by activation patching or ablation on the same prompts.

Figures

Figures reproduced from arXiv: 2605.23393 by Aske Plaat, Niki van Stein, Po-Kai Chen.

Figure 1
Figure 1. Figure 1: Unified key-value view. Both attention and MLP compute a weighted sum [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top-3 composition paths from rerooting at S-Inhibition head A8.H6, filtered [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Knockout validation across the Pythia-deduped family, part 1 of 2. Each point is one [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Knockout validation across the Pythia-deduped family, part 2 of 2. Axes, layer coloring, [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key-value template $\phi(S)U$. We exploit this structure to develop Unpack, a backward recursion that decomposes credit through both sublayers, producing interaction strengths between any two components, named end-to-end paths with K/Q/V composition labels, and per-token attribution from a single forward pass, without intervention, gradients, or auxiliary training. We evaluate on the indirect object identification task. On GPT-2 small, the method recovers all three composition connections described by Wang et al. (2023), including the mode-specific routing of each connection (K, Q, or V). To test token-level attribution beyond trivial copying, we compare two occurrences of the same name in the same decomposition: the first mention retains strong credit while the duplicate-detection position is suppressed, a pattern absent in matched control prompts. Across the Pythia family from 160M to 6.9B parameters, this suppression pattern is consistently recovered at every scale, demonstrating that the method tracks mechanistic structure without ground-truth circuit labels. Code is available at https://github.com/Fun-Cry/unpacklm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Unpack, a backward recursion that exploits a shared key-value template φ(S)U assumed to hold for both attention and MLP sublayers. From a single forward pass it produces interaction strengths, end-to-end paths labeled with K/Q/V composition modes, and per-token attributions without interventions, gradients, or auxiliary training. On the IOI task it recovers the three composition connections (with routing modes) reported by Wang et al. (2023) in GPT-2 small and, across the Pythia family (160M–6.9B), consistently recovers the suppression pattern for duplicate name mentions.

Significance. If the template assumption holds exactly, the approach supplies a parameter-free, single-pass method for circuit discovery and token attribution that scales across model sizes and reproduces known mechanistic structure without ground-truth labels. The public code release is a clear strength for reproducibility. The result would be significant for mechanistic interpretability if the central structural claim is validated.

major comments (1)
  1. [Method (template definition and recursion)] The claim that every sublayer (attention and MLP) exactly matches the shared template φ(S)U is load-bearing for the backward recursion and all downstream attributions. Attention admits a natural key-value form, but a standard MLP (W1, activation, W2) does not; any mismatch propagates incorrect credit through the recursion. The manuscript states the template applies to both but supplies no reconstruction error, approximation bound, or ablation demonstrating that deviations leave the recovered K/Q/V routes and duplicate-name suppression patterns unchanged on IOI.
minor comments (2)
  1. [Method] Notation for the template φ(S)U and the precise definition of the backward recursion step should be stated once in a single equation block rather than distributed across paragraphs.
  2. [Abstract and §4] The abstract and results section refer to “consistent patterns across scales” without reporting a quantitative consistency metric (e.g., cosine similarity of attribution vectors or fraction of recovered routes); adding this would strengthen the cross-scale claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for identifying the central role of the φ(S)U template. We address the single major comment below and will incorporate the requested validation in the revision.

read point-by-point responses
  1. Referee: [Method (template definition and recursion)] The claim that every sublayer (attention and MLP) exactly matches the shared template φ(S)U is load-bearing for the backward recursion and all downstream attributions. Attention admits a natural key-value form, but a standard MLP (W1, activation, W2) does not; any mismatch propagates incorrect credit through the recursion. The manuscript states the template applies to both but supplies no reconstruction error, approximation bound, or ablation demonstrating that deviations leave the recovered K/Q/V routes and duplicate-name suppression patterns unchanged on IOI.

    Authors: The manuscript presents the φ(S)U form as an exact algebraic rewriting that applies to both attention and MLP sublayers (Section 3). For attention this is the standard key-value decomposition. For MLPs we rewrite the two-layer computation by treating the first weight matrix and nonlinearity as defining the selection function φ over an implicit set S of input-derived features, with the second matrix supplying the values U; this is exact by construction for the recursion. We acknowledge that the current version does not report per-sublayer reconstruction error or an explicit ablation isolating the MLP template. In the revision we will add: (i) ||f(x) − φ(S)U||_2 reconstruction errors for every attention and MLP sublayer on the IOI prompts, and (ii) an ablation that replaces the MLP portion of the recursion with direct per-neuron attribution while keeping the attention recursion unchanged, then checks whether the duplicate-name suppression pattern and the three K/Q/V composition routes remain unchanged. These additions will directly quantify any deviation and its effect on the reported findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method derives from posited template but validates against external circuits

full rationale

The paper posits that attention and MLP sublayers share the φ(S)U key-value template and builds the Unpack backward recursion on this structure to produce attributions and paths. However, the central claims recover specific composition connections (including K/Q/V routing) from the independent Wang et al. (2023) work and demonstrate consistent duplicate-suppression patterns across Pythia scales, with comparisons to control prompts. No fitted parameters, self-citations for the template, or reductions of outputs to inputs by construction appear in the provided text. Results are externally benchmarked rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that the φ(S)U template applies uniformly to both sublayers; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Both attention and MLP sublayers follow the shared key-value template φ(S)U.
    This is the foundational structure exploited by the Unpack recursion as stated in the abstract.

pith-pipeline@v0.9.0 · 5758 in / 1323 out tokens · 26275 ms · 2026-05-25T05:12:09.762539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Transformer Circuits Thread , year=

    Circuit Tracing: Revealing Computational Graphs in Language Models , author=. Transformer Circuits Thread , year=

  2. [2]

    Transformer Circuits Thread , year=

    On the Biology of a Large Language Model , author=. Transformer Circuits Thread , year=

  3. [3]

    Proceedings of EMNLP , year=

    Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of EMNLP , year=

  4. [4]

    Transformer Circuits Thread , year=

    A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=

  5. [5]

    Transformer Circuits Thread , year=

    In-context Learning and Induction Heads , author=. Transformer Circuits Thread , year=

  6. [6]

    Proceedings of ICLR , year=

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. Proceedings of ICLR , year=

  7. [7]

    Proceedings of ICML , year=

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. Proceedings of ICML , year=

  8. [8]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , journal=. Locating and Editing Factual Associations in

  9. [9]

    interpreting

    nostalgebraist , journal=. interpreting. 2020 , url=

  10. [10]

    How does

    Hanna, Michael and Liu, Ollie and Variengien, Alexandre , journal=. How does

  11. [11]

    Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics , author=

  12. [12]

    Proceedings of EMNLP , year=

    Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. Proceedings of EMNLP , year=

  13. [13]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  14. [14]

    GPT - N eo X -20 B : An Open-Source Autoregressive Language Model

    Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel. GPT - N eo X -20 B : An Open-Source ...

  15. [15]

    Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =

    Sundararajan, Mukund and Taly, Ankur and Yan, Qiqi , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =

  16. [16]

    Quantifying Attention Flow in Transformers

    Abnar, Samira and Zuidema, Willem. Quantifying Attention Flow in Transformers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.385

  17. [17]

    Attribution Patching Outperforms Automated Circuit Discovery

    Syed, Aaquib and Rager, Can and Conmy, Arthur. Attribution Patching Outperforms Automated Circuit Discovery. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.25

  18. [18]

    2023 , url=

    Attribution Patching: Activation Patching At Industrial Scale , author=. 2023 , url=

  19. [19]

    Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The

  20. [20]

    Language Models are Unsupervised Multitask Learners , author=

  21. [21]

    ICLR , year=

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. ICLR , year=

  22. [22]

    2025 , eprint=

    Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework , author=. 2025 , eprint=

  23. [23]

    Proceedings of EMNLP 2024 , pages =

    Information Flow Routes: Automatically Interpreting Language Models at Scale , author =. Proceedings of EMNLP 2024 , pages =. 2024 , address =

  24. [24]

    Proceedings of EMNLP 2022 , year =

    Measuring the Mixing of Contextual Information in the Transformer , author =. Proceedings of EMNLP 2022 , year =

  25. [25]

    2020 , note =

    Interpreting GPT: The Logit Lens , author =. 2020 , note =

  26. [26]

    Towards Automated Circuit Discovery for Mechanistic Interpretability , url =

    Conmy, Arthur and Mavor-Parker, Augustine and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adri\`. Towards Automated Circuit Discovery for Mechanistic Interpretability , url =. Advances in Neural Information Processing Systems , editor =

  27. [27]

    2024 , eprint=

    Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms , author=. 2024 , eprint=