Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
Pith reviewed 2026-05-22 07:39 UTC · model grok-4.3
The pith
A sparse autoencoder feature in GPT-2 small activations correlates strongly with failures on indirect object identification for prompts using 'the keys' as the object.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that sparse autoencoder features can serve as readable indicators of task failure in language models performing indirect object identification. Specifically, one feature shows a large positive effect size on failure trials and is nearly inactive except on prompts where the object is 'the keys,' leading to a dramatically higher failure rate on those items. Ablation experiments confirm the feature is a correlate rather than a sufficient cause, while prediction baselines indicate that the sparse representation offers interpretability without superior predictive accuracy over the full residual stream. The audit pipeline itself, which is efficient and model-agnostic, is the
What carries the argument
The central mechanism is the statistical comparison of sparse autoencoder feature activations across failed and successful trials, using metrics such as Cohen's d for effect size and Fisher exact tests for association with specific lexical items.
If this is right
- If the audit method generalizes, similar sparse features could be identified for failures in other language model tasks.
- The finding that certain features link to specific object names suggests models may have localized sensitivities to particular words or concepts that cause systematic errors.
- Since ablating the feature does not improve performance, the failure mechanism likely involves interactions across multiple features or layers.
- The equivalence in predictive power between SAE features and raw activations implies that interpretability gains come at little cost to accuracy in failure prediction.
Where Pith is reading between the lines
- The pipeline could be applied to other models or tasks to discover if lexical triggers for failure are common in language models.
- Extending the audit to earlier or later layers might reveal where the decision to fail is made in the network.
- Testing whether retraining or fine-tuning on balanced 'keys' examples reduces the failure rate would check if this is a data artifact.
- Combining this activation audit with causal interventions in other parts of the model could help isolate the actual cause of the error.
Load-bearing premise
The differences in sparse autoencoder activations observed on this particular collection of prompts reflect meaningful aspects of the model's general processing of the indirect object identification task rather than being tied only to the specific wording or statistics of those prompts.
What would settle it
Applying the same feature audit to a fresh set of prompts with varied objects or to a different model size and observing no features with comparably large effect sizes or selective activation on high-failure subsets would indicate that the correlates are not robust.
Figures
read the original abstract
We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom (2024) clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen's d| > 0.8). The strongest single correlate of failure -- feature 17,491, d=+2.93, Neuronpedia label 'cryptographic keys' -- is essentially silent except when the prompt's transferred object is 'the keys,' on which GPT-2 small fails 93.3% of the time vs. 7.5% on the other seven objects (Fisher exact p = 8.79 x 10^-33). We put this correlate through three controls that a mechanistic claim should pass. (i) A causal ablation: zeroing feature 17,491 in the residual stream across all token positions of the 45 keys prompts does not restore accuracy (6.7% -> 4.4%); the feature is a correlate, not a sufficient cause at this layer. (ii) A representation baseline: a logistic regression on the raw 768-dimensional residual stream reaches 5-fold ROC AUC = 0.929, matching the top-100 SAE features (0.927); the SAE basis adds interpretability, not predictive power. (iii) A seed-robustness check: across five random seeds the keys-subset failure rate stays in 75.0--93.3% (the behavioural effect is real), but feature 17,491 is the top-|d| feature in only 1 of 5 runs. The methodological contribution is therefore the audit pipeline (cheap, model-agnostic, surfaces named correlates) rather than any single feature found through it. We release the code, the 300-prompt corpus, the 300x24,576 activation matrix, the ablation and baseline scripts, and the figures. The full pipeline runs on a laptop (Apple M3 Max, no discrete GPU).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a narrow empirical audit of 24,576 SAE features from the layer-8 residual stream of GPT-2 small on a fixed 300-prompt IOI corpus. GPT-2 small achieves 79.7% accuracy; 146 features meet a Holm-corrected significance threshold and 105 show large effect sizes. Feature 17,491 (Neuronpedia label 'cryptographic keys') exhibits the largest effect (d = +2.93) and is active almost exclusively on the 45 'keys' prompts, where accuracy drops to 6.7% (Fisher exact p = 8.79e-33). Three controls are presented: zero-ablation of the feature does not restore performance, a logistic regression on the raw 768-dimensional residual stream matches the predictive power of the top-100 SAE features (ROC AUC 0.929 vs 0.927), and the behavioral failure rate on 'keys' is stable across five seeds while the identity of the top feature is not. The stated contribution is the open audit pipeline and released artifacts rather than any general mechanistic claim.
Significance. If the reported correlations and controls hold, the work supplies a concrete, low-cost template for auditing task failures with named SAE features together with explicit statistical thresholds, a failed causal intervention, and a matched representation baseline. The full release of the 300-prompt corpus, 300-by-24,576 activation matrix, ablation scripts, and figures is a clear strength that directly supports reproducibility and extension by other researchers.
minor comments (3)
- [Abstract] Abstract and §3: the citation 'Bloom (2024)' for the SAE release should be expanded to a full bibliographic entry in the references section.
- [Methods] §4.2: the exact construction of the 300-prompt corpus (sampling of the eight objects, template variations) is described only at high level; a short appendix table listing the object set and prompt template would improve reproducibility.
- [Results] Table 1 or equivalent: the reported ROC AUC values (0.929 vs 0.927) are numerically close; adding a brief note on whether a paired test was considered would clarify that the SAE basis does not add predictive power beyond the raw residual stream.
Simulated Author's Rebuttal
We thank the referee for their summary of the manuscript and for recommending minor revision. We appreciate the recognition of the work's reproducibility strengths, including the full release of the 300-prompt corpus, activation matrix, ablation scripts, and figures. No specific major comments were raised in the report.
Circularity Check
No significant circularity
full rationale
The paper conducts a purely empirical audit: it measures SAE activations on a fixed 300-prompt IOI corpus, applies Holm-corrected significance tests and Cohen's d, runs explicit ablation and logistic-regression baselines, and performs a 5-seed robustness check. All reported results are direct statistical comparisons or experimental outcomes on the released activation matrix; no equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The SAE artifact is cited from external work (Bloom 2024) solely as a data source, not as justification for any uniqueness theorem or ansatz. The stated contribution is the reproducible pipeline and artifact release, which rests on the documented experiments rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Holm correction is appropriate for controlling family-wise error rate across 24,576 simultaneous feature tests.
- domain assumption Cohen's d threshold of 0.8 validly separates large from smaller effect sizes in activation differences.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On the Biology of a Large Language Model,
J. Lindsey, A. Templeton, et al., “On the Biology of a Large Language Model,” Anthropic Technical Report, 2025
work page 2025
-
[2]
Tracing the Thoughts of a Large Language Model,
J. Lindsey, A. Templeton, et al., “Tracing the Thoughts of a Large Language Model,” Anthropic, 2025
work page 2025
-
[3]
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning,
T. Bricken, A. Templeton, J. Batson, et al., “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning,” Anthropic Transformer Circuits Thread, 2023
work page 2023
-
[4]
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,
A. Templeton, T. Conerly, J. Marcus, et al., “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,” Anthropic Transformer Circuits Thread, 2024
work page 2024
-
[5]
Sparse Autoencoders Find Highly Interpretable Features in Language Models,
H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse Autoencoders Find Highly Interpretable Features in Language Models,” International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[6]
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
T. Lieberum, S. Rajamanoharan, A. Conmy, et al., “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2,” arXiv preprint arXiv:2408.05147, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,
B. A. Olshausen and D. J. Field, “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,”Nature, vol. 381, no. 6583, pp. 607–609, 1996
work page 1996
-
[8]
Zoom In: An Introduction to Circuits,
C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, “Zoom In: An Introduction to Circuits,”Distill, 2020
work page 2020
-
[9]
N. Elhage, T. Hume, C. Olsson, et al., “Toy Models of Superposition,” Anthropic Transformer Circuits Thread, 2022
work page 2022
-
[10]
Linear Algebraic Structure of Word Senses, with Applications to Polysemy,
S. Arora, Y . Li, Y . Liang, T. Ma, and A. Risteski, “Linear Algebraic Structure of Word Senses, with Applications to Polysemy,”Transac- tions of the Association for Computational Linguistics (TACL), vol. 6, pp. 483–495, 2018
work page 2018
-
[11]
Online Dictionary Learning for Sparse Coding,
J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Dictionary Learning for Sparse Coding,”International Conference on Machine Learning (ICML), 2009
work page 2009
-
[12]
Transcoders Find Interpretable LLM Feature Circuits,
J. Dunefsky, P. Chlenski, and N. Nanda, “Transcoders Find Interpretable LLM Feature Circuits,” arXiv preprint arXiv:2406.11944, 2024
-
[13]
Sparse Crosscoders for Cross-Layer Features and Model Diffing,
J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, and C. Olah, “Sparse Crosscoders for Cross-Layer Features and Model Diffing,” Anthropic Transformer Circuits Thread, 2024
work page 2024
-
[14]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small,
K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt, “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small,”International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[15]
Towards Automated Circuit Discovery for Mechanistic Interpretability,
A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, “Towards Automated Circuit Discovery for Mechanistic Interpretability,”Advances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[16]
M. Hanna, O. Liu, and A. Variengien, “How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model,”Advances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[17]
Localizing Model Behavior with Path Patching
A. Goldowsky-Dill, C. MacLeod, L. Sato, and A. Arora, “Localizing Model Behavior with Path Patching,” arXiv preprint arXiv:2304.05969, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
S. Marks, C. Rager, E. Michaud, et al., “Sparse Feature Circuits: Dis- covering and Editing Interpretable Causal Graphs in Language Models,” arXiv preprint arXiv:2403.19647, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Causal Abstractions of Neural Networks,
A. Geiger, H. Lu, T. Icard, and C. Potts, “Causal Abstractions of Neural Networks,”Advances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[20]
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,
J. Vig, S. Gehrmann, Y . Belinkov, et al., “Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,” arXiv preprint arXiv:2004.12265, 2020
-
[21]
Locating and Editing Factual Associations in GPT,
K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and Editing Factual Associations in GPT,”Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[22]
Open Source Sparse Autoencoders for All Residual Stream Layers of GPT2-Small,
J. Bloom, “Open Source Sparse Autoencoders for All Residual Stream Layers of GPT2-Small,” LessWrong / AI Alignment Forum, 2024
work page 2024
-
[23]
SAELens: Training, Analyzing, and Visu- alizing Sparse Autoencoders,
J. Bloom, A. Kissane, et al., “SAELens: Training, Analyzing, and Visu- alizing Sparse Autoencoders,” https://github.com/jbloomAus/SAELens, 2024
work page 2024
-
[24]
Neuronpedia: An Open Platform for Mechanistic Inter- pretability Research,
J. Lin, et al., “Neuronpedia: An Open Platform for Mechanistic Inter- pretability Research,” https://www.neuronpedia.org, 2024
work page 2024
-
[25]
TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models,
N. Nanda and J. Bloom, “TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models,” https://github.com/ TransformerLensOrg/TransformerLens, 2022
work page 2022
-
[26]
Language Models are Unsupervised Multitask Learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” OpenAI Tech- nical Report, 2019
work page 2019
-
[27]
The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved,
B. L. Welch, “The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved,”Biometrika, vol. 34, no. 1- 2, pp. 28–35, 1947
work page 1947
-
[28]
Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed
J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Routledge, 1988
work page 1988
-
[29]
A Simple Sequentially Rejective Multiple Test Procedure,
S. Holm, “A Simple Sequentially Rejective Multiple Test Procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979
work page 1979
-
[30]
Scikit-learn: Machine Learning in Python,
F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine Learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011
work page 2011
-
[31]
An Introduction to ROC Analysis,
T. Fawcett, “An Introduction to ROC Analysis,”Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006
work page 2006
-
[32]
Deep Reinforcement Learning that Matters,
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep Reinforcement Learning that Matters,”AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[33]
Improving Repro- ducibility in Machine Learning Research,
J. Pineau, P. Vincent-Lamarre, K. Sinha, et al., “Improving Repro- ducibility in Machine Learning Research,”Journal of Machine Learning Research, vol. 22, no. 164, pp. 1–20, 2021
work page 2021
-
[34]
Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,
M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,”Interna- tional Conference on Learning Representations (ICLR), 2024
work page 2024
-
[35]
State of What Art? A Call for Multi-Prompt LLM Eval- uation,
M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of What Art? A Call for Multi-Prompt LLM Eval- uation,”Transactions of the Association for Computational Linguistics (TACL), 2024
work page 2024
-
[36]
Y . Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantasti- cally Ordered Prompts and Where to Find Them: Overcoming Few- Shot Prompt Order Sensitivity,”Annual Meeting of the Association for Computational Linguistics (ACL), 2022
work page 2022
-
[37]
AgentBench: Evaluating LLMs as Agents,
X. Liu, H. Yu, H. Zhang, et al., “AgentBench: Evaluating LLMs as Agents,”International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[38]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
C. E. Jimenez, J. Yang, A. Wettig, et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[39]
WebArena: A Realistic Web Environ- ment for Building Autonomous Agents,
S. Zhou, F. F. Xu, H. Zhu, et al., “WebArena: A Realistic Web Environ- ment for Building Autonomous Agents,”International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[40]
M. Nasermoghadasi, “What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema,”IEEE International Conference on Big Data, 2026 (companion submission)
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.