pith. sign in

arxiv: 2605.22719 · v1 · pith:VBLHXE6Pnew · submitted 2026-05-21 · 💻 cs.LG

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

Pith reviewed 2026-05-22 07:39 UTC · model grok-4.3

classification 💻 cs.LG
keywords sparse autoencodersGPT-2 smallindirect object identificationtask failureactivation differencesfeature auditingmechanistic analysis
0
0 comments X

The pith

A sparse autoencoder feature in GPT-2 small activations correlates strongly with failures on indirect object identification for prompts using 'the keys' as the object.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines differences in sparse autoencoder features from GPT-2 small's activations during successful and failed attempts at the indirect object identification task. Using 300 prompts where the model achieves about 80 percent accuracy, it identifies numerous features that activate differently on failure cases. The most prominent one activates almost exclusively when the transferred object is 'the keys,' a scenario where the model fails over 90 percent of the time compared to under 8 percent otherwise. Through controls including ablating the feature, comparing to raw activations, and checking across random seeds, the work shows this is a reliable behavioral pattern but the specific feature is not the sole cause. The primary advance lies in providing an accessible method to surface such interpretable correlates of model errors.

Core claim

The paper establishes that sparse autoencoder features can serve as readable indicators of task failure in language models performing indirect object identification. Specifically, one feature shows a large positive effect size on failure trials and is nearly inactive except on prompts where the object is 'the keys,' leading to a dramatically higher failure rate on those items. Ablation experiments confirm the feature is a correlate rather than a sufficient cause, while prediction baselines indicate that the sparse representation offers interpretability without superior predictive accuracy over the full residual stream. The audit pipeline itself, which is efficient and model-agnostic, is the

What carries the argument

The central mechanism is the statistical comparison of sparse autoencoder feature activations across failed and successful trials, using metrics such as Cohen's d for effect size and Fisher exact tests for association with specific lexical items.

If this is right

  • If the audit method generalizes, similar sparse features could be identified for failures in other language model tasks.
  • The finding that certain features link to specific object names suggests models may have localized sensitivities to particular words or concepts that cause systematic errors.
  • Since ablating the feature does not improve performance, the failure mechanism likely involves interactions across multiple features or layers.
  • The equivalence in predictive power between SAE features and raw activations implies that interpretability gains come at little cost to accuracy in failure prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pipeline could be applied to other models or tasks to discover if lexical triggers for failure are common in language models.
  • Extending the audit to earlier or later layers might reveal where the decision to fail is made in the network.
  • Testing whether retraining or fine-tuning on balanced 'keys' examples reduces the failure rate would check if this is a data artifact.
  • Combining this activation audit with causal interventions in other parts of the model could help isolate the actual cause of the error.

Load-bearing premise

The differences in sparse autoencoder activations observed on this particular collection of prompts reflect meaningful aspects of the model's general processing of the indirect object identification task rather than being tied only to the specific wording or statistics of those prompts.

What would settle it

Applying the same feature audit to a fresh set of prompts with varied objects or to a different model size and observing no features with comparably large effect sizes or selective activation on high-failure subsets would indicate that the correlates are not robust.

Figures

Figures reproduced from arXiv: 2605.22719 by Mahdi Nasermoghadasi.

Figure 1
Figure 1. Figure 1: The audit pipeline. Inputs (blue) are a task corpus, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Volcano plot of all 24,576 SAE features at layer 8. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: IOI failure rate by transferred-object choice. Seven of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Causal ablation. Zeroing feature 17,491 across all token [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure-prediction AUC under four feature representa [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: ROC curve for predicting IOI failure from feature [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom (2024) clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen's d| > 0.8). The strongest single correlate of failure -- feature 17,491, d=+2.93, Neuronpedia label 'cryptographic keys' -- is essentially silent except when the prompt's transferred object is 'the keys,' on which GPT-2 small fails 93.3% of the time vs. 7.5% on the other seven objects (Fisher exact p = 8.79 x 10^-33). We put this correlate through three controls that a mechanistic claim should pass. (i) A causal ablation: zeroing feature 17,491 in the residual stream across all token positions of the 45 keys prompts does not restore accuracy (6.7% -> 4.4%); the feature is a correlate, not a sufficient cause at this layer. (ii) A representation baseline: a logistic regression on the raw 768-dimensional residual stream reaches 5-fold ROC AUC = 0.929, matching the top-100 SAE features (0.927); the SAE basis adds interpretability, not predictive power. (iii) A seed-robustness check: across five random seeds the keys-subset failure rate stays in 75.0--93.3% (the behavioural effect is real), but feature 17,491 is the top-|d| feature in only 1 of 5 runs. The methodological contribution is therefore the audit pipeline (cheap, model-agnostic, surfaces named correlates) rather than any single feature found through it. We release the code, the 300-prompt corpus, the 300x24,576 activation matrix, the ablation and baseline scripts, and the figures. The full pipeline runs on a laptop (Apple M3 Max, no discrete GPU).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript reports a narrow empirical audit of 24,576 SAE features from the layer-8 residual stream of GPT-2 small on a fixed 300-prompt IOI corpus. GPT-2 small achieves 79.7% accuracy; 146 features meet a Holm-corrected significance threshold and 105 show large effect sizes. Feature 17,491 (Neuronpedia label 'cryptographic keys') exhibits the largest effect (d = +2.93) and is active almost exclusively on the 45 'keys' prompts, where accuracy drops to 6.7% (Fisher exact p = 8.79e-33). Three controls are presented: zero-ablation of the feature does not restore performance, a logistic regression on the raw 768-dimensional residual stream matches the predictive power of the top-100 SAE features (ROC AUC 0.929 vs 0.927), and the behavioral failure rate on 'keys' is stable across five seeds while the identity of the top feature is not. The stated contribution is the open audit pipeline and released artifacts rather than any general mechanistic claim.

Significance. If the reported correlations and controls hold, the work supplies a concrete, low-cost template for auditing task failures with named SAE features together with explicit statistical thresholds, a failed causal intervention, and a matched representation baseline. The full release of the 300-prompt corpus, 300-by-24,576 activation matrix, ablation scripts, and figures is a clear strength that directly supports reproducibility and extension by other researchers.

minor comments (3)
  1. [Abstract] Abstract and §3: the citation 'Bloom (2024)' for the SAE release should be expanded to a full bibliographic entry in the references section.
  2. [Methods] §4.2: the exact construction of the 300-prompt corpus (sampling of the eight objects, template variations) is described only at high level; a short appendix table listing the object set and prompt template would improve reproducibility.
  3. [Results] Table 1 or equivalent: the reported ROC AUC values (0.929 vs 0.927) are numerically close; adding a brief note on whether a paired test was considered would clarify that the SAE basis does not add predictive power beyond the raw residual stream.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for recommending minor revision. We appreciate the recognition of the work's reproducibility strengths, including the full release of the 300-prompt corpus, activation matrix, ablation scripts, and figures. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper conducts a purely empirical audit: it measures SAE activations on a fixed 300-prompt IOI corpus, applies Holm-corrected significance tests and Cohen's d, runs explicit ablation and logistic-regression baselines, and performs a 5-seed robustness check. All reported results are direct statistical comparisons or experimental outcomes on the released activation matrix; no equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The SAE artifact is cited from external work (Bloom 2024) solely as a data source, not as justification for any uniqueness theorem or ansatz. The stated contribution is the reproducible pipeline and artifact release, which rests on the documented experiments rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard statistical assumptions from prior SAE releases and conventional hypothesis testing; no new free parameters, axioms beyond standard math, or invented entities are introduced in the reported audit.

axioms (2)
  • standard math Holm correction is appropriate for controlling family-wise error rate across 24,576 simultaneous feature tests.
    Applied to identify the 146 significant features.
  • domain assumption Cohen's d threshold of 0.8 validly separates large from smaller effect sizes in activation differences.
    Used to filter the 105 large-effect features.

pith-pipeline@v0.9.0 · 5961 in / 1516 out tokens · 70319 ms · 2026-05-22T07:39:25.377313+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    On the Biology of a Large Language Model,

    J. Lindsey, A. Templeton, et al., “On the Biology of a Large Language Model,” Anthropic Technical Report, 2025

  2. [2]

    Tracing the Thoughts of a Large Language Model,

    J. Lindsey, A. Templeton, et al., “Tracing the Thoughts of a Large Language Model,” Anthropic, 2025

  3. [3]

    Towards Monosemanticity: Decomposing Language Models with Dictionary Learning,

    T. Bricken, A. Templeton, J. Batson, et al., “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning,” Anthropic Transformer Circuits Thread, 2023

  4. [4]

    Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,

    A. Templeton, T. Conerly, J. Marcus, et al., “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,” Anthropic Transformer Circuits Thread, 2024

  5. [5]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models,

    H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse Autoencoders Find Highly Interpretable Features in Language Models,” International Conference on Learning Representations (ICLR), 2024

  6. [6]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    T. Lieberum, S. Rajamanoharan, A. Conmy, et al., “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2,” arXiv preprint arXiv:2408.05147, 2024

  7. [7]

    Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,

    B. A. Olshausen and D. J. Field, “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,”Nature, vol. 381, no. 6583, pp. 607–609, 1996

  8. [8]

    Zoom In: An Introduction to Circuits,

    C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, “Zoom In: An Introduction to Circuits,”Distill, 2020

  9. [9]

    Toy Models of Superposition,

    N. Elhage, T. Hume, C. Olsson, et al., “Toy Models of Superposition,” Anthropic Transformer Circuits Thread, 2022

  10. [10]

    Linear Algebraic Structure of Word Senses, with Applications to Polysemy,

    S. Arora, Y . Li, Y . Liang, T. Ma, and A. Risteski, “Linear Algebraic Structure of Word Senses, with Applications to Polysemy,”Transac- tions of the Association for Computational Linguistics (TACL), vol. 6, pp. 483–495, 2018

  11. [11]

    Online Dictionary Learning for Sparse Coding,

    J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Dictionary Learning for Sparse Coding,”International Conference on Machine Learning (ICML), 2009

  12. [12]

    Transcoders Find Interpretable LLM Feature Circuits,

    J. Dunefsky, P. Chlenski, and N. Nanda, “Transcoders Find Interpretable LLM Feature Circuits,” arXiv preprint arXiv:2406.11944, 2024

  13. [13]

    Sparse Crosscoders for Cross-Layer Features and Model Diffing,

    J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, and C. Olah, “Sparse Crosscoders for Cross-Layer Features and Model Diffing,” Anthropic Transformer Circuits Thread, 2024

  14. [14]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small,

    K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt, “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small,”International Conference on Learning Representations (ICLR), 2023

  15. [15]

    Towards Automated Circuit Discovery for Mechanistic Interpretability,

    A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, “Towards Automated Circuit Discovery for Mechanistic Interpretability,”Advances in Neural Information Processing Systems (NeurIPS), 2023

  16. [16]

    How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model,

    M. Hanna, O. Liu, and A. Variengien, “How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model,”Advances in Neural Information Processing Systems (NeurIPS), 2023

  17. [17]

    Localizing Model Behavior with Path Patching

    A. Goldowsky-Dill, C. MacLeod, L. Sato, and A. Arora, “Localizing Model Behavior with Path Patching,” arXiv preprint arXiv:2304.05969, 2023

  18. [18]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    S. Marks, C. Rager, E. Michaud, et al., “Sparse Feature Circuits: Dis- covering and Editing Interpretable Causal Graphs in Language Models,” arXiv preprint arXiv:2403.19647, 2024

  19. [19]

    Causal Abstractions of Neural Networks,

    A. Geiger, H. Lu, T. Icard, and C. Potts, “Causal Abstractions of Neural Networks,”Advances in Neural Information Processing Systems (NeurIPS), 2021

  20. [20]

    Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,

    J. Vig, S. Gehrmann, Y . Belinkov, et al., “Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,” arXiv preprint arXiv:2004.12265, 2020

  21. [21]

    Locating and Editing Factual Associations in GPT,

    K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and Editing Factual Associations in GPT,”Advances in Neural Information Processing Systems (NeurIPS), 2022

  22. [22]

    Open Source Sparse Autoencoders for All Residual Stream Layers of GPT2-Small,

    J. Bloom, “Open Source Sparse Autoencoders for All Residual Stream Layers of GPT2-Small,” LessWrong / AI Alignment Forum, 2024

  23. [23]

    SAELens: Training, Analyzing, and Visu- alizing Sparse Autoencoders,

    J. Bloom, A. Kissane, et al., “SAELens: Training, Analyzing, and Visu- alizing Sparse Autoencoders,” https://github.com/jbloomAus/SAELens, 2024

  24. [24]

    Neuronpedia: An Open Platform for Mechanistic Inter- pretability Research,

    J. Lin, et al., “Neuronpedia: An Open Platform for Mechanistic Inter- pretability Research,” https://www.neuronpedia.org, 2024

  25. [25]

    TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models,

    N. Nanda and J. Bloom, “TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models,” https://github.com/ TransformerLensOrg/TransformerLens, 2022

  26. [26]

    Language Models are Unsupervised Multitask Learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” OpenAI Tech- nical Report, 2019

  27. [27]

    The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved,

    B. L. Welch, “The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved,”Biometrika, vol. 34, no. 1- 2, pp. 28–35, 1947

  28. [28]

    Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

    J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Routledge, 1988

  29. [29]

    A Simple Sequentially Rejective Multiple Test Procedure,

    S. Holm, “A Simple Sequentially Rejective Multiple Test Procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979

  30. [30]

    Scikit-learn: Machine Learning in Python,

    F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine Learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011

  31. [31]

    An Introduction to ROC Analysis,

    T. Fawcett, “An Introduction to ROC Analysis,”Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006

  32. [32]

    Deep Reinforcement Learning that Matters,

    P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep Reinforcement Learning that Matters,”AAAI Conference on Artificial Intelligence, 2018

  33. [33]

    Improving Repro- ducibility in Machine Learning Research,

    J. Pineau, P. Vincent-Lamarre, K. Sinha, et al., “Improving Repro- ducibility in Machine Learning Research,”Journal of Machine Learning Research, vol. 22, no. 164, pp. 1–20, 2021

  34. [34]

    Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,

    M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,”Interna- tional Conference on Learning Representations (ICLR), 2024

  35. [35]

    State of What Art? A Call for Multi-Prompt LLM Eval- uation,

    M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of What Art? A Call for Multi-Prompt LLM Eval- uation,”Transactions of the Association for Computational Linguistics (TACL), 2024

  36. [36]

    Fantasti- cally Ordered Prompts and Where to Find Them: Overcoming Few- Shot Prompt Order Sensitivity,

    Y . Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantasti- cally Ordered Prompts and Where to Find Them: Overcoming Few- Shot Prompt Order Sensitivity,”Annual Meeting of the Association for Computational Linguistics (ACL), 2022

  37. [37]

    AgentBench: Evaluating LLMs as Agents,

    X. Liu, H. Yu, H. Zhang, et al., “AgentBench: Evaluating LLMs as Agents,”International Conference on Learning Representations (ICLR), 2024

  38. [38]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    C. E. Jimenez, J. Yang, A. Wettig, et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”International Conference on Learning Representations (ICLR), 2024

  39. [39]

    WebArena: A Realistic Web Environ- ment for Building Autonomous Agents,

    S. Zhou, F. F. Xu, H. Zhu, et al., “WebArena: A Realistic Web Environ- ment for Building Autonomous Agents,”International Conference on Learning Representations (ICLR), 2024

  40. [40]

    What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema,

    M. Nasermoghadasi, “What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema,”IEEE International Conference on Big Data, 2026 (companion submission)