Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

Mahdi Nasermoghadasi

arxiv: 2605.22719 · v1 · pith:VBLHXE6Pnew · submitted 2026-05-21 · 💻 cs.LG

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

Mahdi Nasermoghadasi This is my paper

Pith reviewed 2026-05-22 07:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse autoencodersGPT-2 smallindirect object identificationtask failureactivation differencesfeature auditingmechanistic analysis

0 comments

The pith

A sparse autoencoder feature in GPT-2 small activations correlates strongly with failures on indirect object identification for prompts using 'the keys' as the object.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines differences in sparse autoencoder features from GPT-2 small's activations during successful and failed attempts at the indirect object identification task. Using 300 prompts where the model achieves about 80 percent accuracy, it identifies numerous features that activate differently on failure cases. The most prominent one activates almost exclusively when the transferred object is 'the keys,' a scenario where the model fails over 90 percent of the time compared to under 8 percent otherwise. Through controls including ablating the feature, comparing to raw activations, and checking across random seeds, the work shows this is a reliable behavioral pattern but the specific feature is not the sole cause. The primary advance lies in providing an accessible method to surface such interpretable correlates of model errors.

Core claim

The paper establishes that sparse autoencoder features can serve as readable indicators of task failure in language models performing indirect object identification. Specifically, one feature shows a large positive effect size on failure trials and is nearly inactive except on prompts where the object is 'the keys,' leading to a dramatically higher failure rate on those items. Ablation experiments confirm the feature is a correlate rather than a sufficient cause, while prediction baselines indicate that the sparse representation offers interpretability without superior predictive accuracy over the full residual stream. The audit pipeline itself, which is efficient and model-agnostic, is the

What carries the argument

The central mechanism is the statistical comparison of sparse autoencoder feature activations across failed and successful trials, using metrics such as Cohen's d for effect size and Fisher exact tests for association with specific lexical items.

If this is right

If the audit method generalizes, similar sparse features could be identified for failures in other language model tasks.
The finding that certain features link to specific object names suggests models may have localized sensitivities to particular words or concepts that cause systematic errors.
Since ablating the feature does not improve performance, the failure mechanism likely involves interactions across multiple features or layers.
The equivalence in predictive power between SAE features and raw activations implies that interpretability gains come at little cost to accuracy in failure prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pipeline could be applied to other models or tasks to discover if lexical triggers for failure are common in language models.
Extending the audit to earlier or later layers might reveal where the decision to fail is made in the network.
Testing whether retraining or fine-tuning on balanced 'keys' examples reduces the failure rate would check if this is a data artifact.
Combining this activation audit with causal interventions in other parts of the model could help isolate the actual cause of the error.

Load-bearing premise

The differences in sparse autoencoder activations observed on this particular collection of prompts reflect meaningful aspects of the model's general processing of the indirect object identification task rather than being tied only to the specific wording or statistics of those prompts.

What would settle it

Applying the same feature audit to a fresh set of prompts with varied objects or to a different model size and observing no features with comparably large effect sizes or selective activation on high-failure subsets would indicate that the correlates are not robust.

Figures

Figures reproduced from arXiv: 2605.22719 by Mahdi Nasermoghadasi.

**Figure 2.** Figure 2: Volcano plot of all 24,576 SAE features at layer 8. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: IOI failure rate by transferred-object choice. Seven of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Causal ablation. Zeroing feature 17,491 across all token [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Failure-prediction AUC under four feature representa [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 8.** Figure 8: ROC curve for predicting IOI failure from feature [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom (2024) clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen's d| > 0.8). The strongest single correlate of failure -- feature 17,491, d=+2.93, Neuronpedia label 'cryptographic keys' -- is essentially silent except when the prompt's transferred object is 'the keys,' on which GPT-2 small fails 93.3% of the time vs. 7.5% on the other seven objects (Fisher exact p = 8.79 x 10^-33). We put this correlate through three controls that a mechanistic claim should pass. (i) A causal ablation: zeroing feature 17,491 in the residual stream across all token positions of the 45 keys prompts does not restore accuracy (6.7% -> 4.4%); the feature is a correlate, not a sufficient cause at this layer. (ii) A representation baseline: a logistic regression on the raw 768-dimensional residual stream reaches 5-fold ROC AUC = 0.929, matching the top-100 SAE features (0.927); the SAE basis adds interpretability, not predictive power. (iii) A seed-robustness check: across five random seeds the keys-subset failure rate stays in 75.0--93.3% (the behavioural effect is real), but feature 17,491 is the top-|d| feature in only 1 of 5 runs. The methodological contribution is therefore the audit pipeline (cheap, model-agnostic, surfaces named correlates) rather than any single feature found through it. We release the code, the 300-prompt corpus, the 300x24,576 activation matrix, the ablation and baseline scripts, and the figures. The full pipeline runs on a laptop (Apple M3 Max, no discrete GPU).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A narrow but careful empirical audit of SAE features on IOI failures in GPT-2 small, with full artifact release and appropriate controls.

read the letter

This paper runs a small, reproducible audit of SAE features in GPT-2 small on the indirect object identification task. On 300 prompts it finds 146 features that differ significantly between successes and failures, with one standout feature tied to cryptographic keys that activates almost only on the prompts where the model fails badly when the object is keys. What stands out is how carefully they check their findings. They do a causal ablation that shows zeroing the feature doesn't fix the errors, compare against a simple logistic regression on the raw residual stream which performs about the same, and check that the behavioral failure rate holds across seeds even if the specific feature doesn't. They also release the full code, prompts, activation data, and scripts so others can reproduce or extend it. The soft spots are mostly about scope. The prompt set is narrow, focused on eight specific objects, so the keys correlate could be an artifact of that particular lexical choice rather than something deeper. The feature isn't the top one in most random seeds, which they report but means individual features may not be reliable signals. It's not claiming to explain the mechanism, just to surface correlates. This is useful for researchers doing interpretability work who need examples of how to audit failures with SAEs in a controlled way. It won't change how most people think about IOI or SAEs broadly, but the pipeline and data release could save time for someone trying similar things. I would send this to peer review. The empirical work is solid for what it is, the limitations are stated clearly, and the artifacts add real value.

Referee Report

0 major / 3 minor

Summary. The manuscript reports a narrow empirical audit of 24,576 SAE features from the layer-8 residual stream of GPT-2 small on a fixed 300-prompt IOI corpus. GPT-2 small achieves 79.7% accuracy; 146 features meet a Holm-corrected significance threshold and 105 show large effect sizes. Feature 17,491 (Neuronpedia label 'cryptographic keys') exhibits the largest effect (d = +2.93) and is active almost exclusively on the 45 'keys' prompts, where accuracy drops to 6.7% (Fisher exact p = 8.79e-33). Three controls are presented: zero-ablation of the feature does not restore performance, a logistic regression on the raw 768-dimensional residual stream matches the predictive power of the top-100 SAE features (ROC AUC 0.929 vs 0.927), and the behavioral failure rate on 'keys' is stable across five seeds while the identity of the top feature is not. The stated contribution is the open audit pipeline and released artifacts rather than any general mechanistic claim.

Significance. If the reported correlations and controls hold, the work supplies a concrete, low-cost template for auditing task failures with named SAE features together with explicit statistical thresholds, a failed causal intervention, and a matched representation baseline. The full release of the 300-prompt corpus, 300-by-24,576 activation matrix, ablation scripts, and figures is a clear strength that directly supports reproducibility and extension by other researchers.

minor comments (3)

[Abstract] Abstract and §3: the citation 'Bloom (2024)' for the SAE release should be expanded to a full bibliographic entry in the references section.
[Methods] §4.2: the exact construction of the 300-prompt corpus (sampling of the eight objects, template variations) is described only at high level; a short appendix table listing the object set and prompt template would improve reproducibility.
[Results] Table 1 or equivalent: the reported ROC AUC values (0.929 vs 0.927) are numerically close; adding a brief note on whether a paired test was considered would clarify that the SAE basis does not add predictive power beyond the raw residual stream.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for recommending minor revision. We appreciate the recognition of the work's reproducibility strengths, including the full release of the 300-prompt corpus, activation matrix, ablation scripts, and figures. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper conducts a purely empirical audit: it measures SAE activations on a fixed 300-prompt IOI corpus, applies Holm-corrected significance tests and Cohen's d, runs explicit ablation and logistic-regression baselines, and performs a 5-seed robustness check. All reported results are direct statistical comparisons or experimental outcomes on the released activation matrix; no equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The SAE artifact is cited from external work (Bloom 2024) solely as a data source, not as justification for any uniqueness theorem or ansatz. The stated contribution is the reproducible pipeline and artifact release, which rests on the documented experiments rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard statistical assumptions from prior SAE releases and conventional hypothesis testing; no new free parameters, axioms beyond standard math, or invented entities are introduced in the reported audit.

axioms (2)

standard math Holm correction is appropriate for controlling family-wise error rate across 24,576 simultaneous feature tests.
Applied to identify the 146 significant features.
domain assumption Cohen's d threshold of 0.8 validly separates large from smaller effect sizes in activation differences.
Used to filter the 105 large-effect features.

pith-pipeline@v0.9.0 · 5961 in / 1516 out tokens · 70319 ms · 2026-05-22T07:39:25.377313+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

On the Biology of a Large Language Model,

J. Lindsey, A. Templeton, et al., “On the Biology of a Large Language Model,” Anthropic Technical Report, 2025

work page 2025
[2]

Tracing the Thoughts of a Large Language Model,

J. Lindsey, A. Templeton, et al., “Tracing the Thoughts of a Large Language Model,” Anthropic, 2025

work page 2025
[3]

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning,

T. Bricken, A. Templeton, J. Batson, et al., “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning,” Anthropic Transformer Circuits Thread, 2023

work page 2023
[4]

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,

A. Templeton, T. Conerly, J. Marcus, et al., “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,” Anthropic Transformer Circuits Thread, 2024

work page 2024
[5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models,

H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse Autoencoders Find Highly Interpretable Features in Language Models,” International Conference on Learning Representations (ICLR), 2024

work page 2024
[6]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

T. Lieberum, S. Rajamanoharan, A. Conmy, et al., “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2,” arXiv preprint arXiv:2408.05147, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,

B. A. Olshausen and D. J. Field, “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,”Nature, vol. 381, no. 6583, pp. 607–609, 1996

work page 1996
[8]

Zoom In: An Introduction to Circuits,

C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, “Zoom In: An Introduction to Circuits,”Distill, 2020

work page 2020
[9]

Toy Models of Superposition,

N. Elhage, T. Hume, C. Olsson, et al., “Toy Models of Superposition,” Anthropic Transformer Circuits Thread, 2022

work page 2022
[10]

Linear Algebraic Structure of Word Senses, with Applications to Polysemy,

S. Arora, Y . Li, Y . Liang, T. Ma, and A. Risteski, “Linear Algebraic Structure of Word Senses, with Applications to Polysemy,”Transac- tions of the Association for Computational Linguistics (TACL), vol. 6, pp. 483–495, 2018

work page 2018
[11]

Online Dictionary Learning for Sparse Coding,

J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Dictionary Learning for Sparse Coding,”International Conference on Machine Learning (ICML), 2009

work page 2009
[12]

Transcoders Find Interpretable LLM Feature Circuits,

J. Dunefsky, P. Chlenski, and N. Nanda, “Transcoders Find Interpretable LLM Feature Circuits,” arXiv preprint arXiv:2406.11944, 2024

work page arXiv 2024
[13]

Sparse Crosscoders for Cross-Layer Features and Model Diffing,

J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, and C. Olah, “Sparse Crosscoders for Cross-Layer Features and Model Diffing,” Anthropic Transformer Circuits Thread, 2024

work page 2024
[14]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small,

K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt, “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small,”International Conference on Learning Representations (ICLR), 2023

work page 2023
[15]

Towards Automated Circuit Discovery for Mechanistic Interpretability,

A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, “Towards Automated Circuit Discovery for Mechanistic Interpretability,”Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[16]

How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model,

M. Hanna, O. Liu, and A. Variengien, “How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model,”Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[17]

Localizing Model Behavior with Path Patching

A. Goldowsky-Dill, C. MacLeod, L. Sato, and A. Arora, “Localizing Model Behavior with Path Patching,” arXiv preprint arXiv:2304.05969, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

S. Marks, C. Rager, E. Michaud, et al., “Sparse Feature Circuits: Dis- covering and Editing Interpretable Causal Graphs in Language Models,” arXiv preprint arXiv:2403.19647, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Causal Abstractions of Neural Networks,

A. Geiger, H. Lu, T. Icard, and C. Potts, “Causal Abstractions of Neural Networks,”Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[20]

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,

J. Vig, S. Gehrmann, Y . Belinkov, et al., “Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,” arXiv preprint arXiv:2004.12265, 2020

work page arXiv 2004
[21]

Locating and Editing Factual Associations in GPT,

K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and Editing Factual Associations in GPT,”Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[22]

Open Source Sparse Autoencoders for All Residual Stream Layers of GPT2-Small,

J. Bloom, “Open Source Sparse Autoencoders for All Residual Stream Layers of GPT2-Small,” LessWrong / AI Alignment Forum, 2024

work page 2024
[23]

SAELens: Training, Analyzing, and Visu- alizing Sparse Autoencoders,

J. Bloom, A. Kissane, et al., “SAELens: Training, Analyzing, and Visu- alizing Sparse Autoencoders,” https://github.com/jbloomAus/SAELens, 2024

work page 2024
[24]

Neuronpedia: An Open Platform for Mechanistic Inter- pretability Research,

J. Lin, et al., “Neuronpedia: An Open Platform for Mechanistic Inter- pretability Research,” https://www.neuronpedia.org, 2024

work page 2024
[25]

TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models,

N. Nanda and J. Bloom, “TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models,” https://github.com/ TransformerLensOrg/TransformerLens, 2022

work page 2022
[26]

Language Models are Unsupervised Multitask Learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” OpenAI Tech- nical Report, 2019

work page 2019
[27]

The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved,

B. L. Welch, “The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved,”Biometrika, vol. 34, no. 1- 2, pp. 28–35, 1947

work page 1947
[28]

Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Routledge, 1988

work page 1988
[29]

A Simple Sequentially Rejective Multiple Test Procedure,

S. Holm, “A Simple Sequentially Rejective Multiple Test Procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979

work page 1979
[30]

Scikit-learn: Machine Learning in Python,

F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine Learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011

work page 2011
[31]

An Introduction to ROC Analysis,

T. Fawcett, “An Introduction to ROC Analysis,”Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006

work page 2006
[32]

Deep Reinforcement Learning that Matters,

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep Reinforcement Learning that Matters,”AAAI Conference on Artificial Intelligence, 2018

work page 2018
[33]

Improving Repro- ducibility in Machine Learning Research,

J. Pineau, P. Vincent-Lamarre, K. Sinha, et al., “Improving Repro- ducibility in Machine Learning Research,”Journal of Machine Learning Research, vol. 22, no. 164, pp. 1–20, 2021

work page 2021
[34]

Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,

M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,”Interna- tional Conference on Learning Representations (ICLR), 2024

work page 2024
[35]

State of What Art? A Call for Multi-Prompt LLM Eval- uation,

M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of What Art? A Call for Multi-Prompt LLM Eval- uation,”Transactions of the Association for Computational Linguistics (TACL), 2024

work page 2024
[36]

Fantasti- cally Ordered Prompts and Where to Find Them: Overcoming Few- Shot Prompt Order Sensitivity,

Y . Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantasti- cally Ordered Prompts and Where to Find Them: Overcoming Few- Shot Prompt Order Sensitivity,”Annual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022
[37]

AgentBench: Evaluating LLMs as Agents,

X. Liu, H. Yu, H. Zhang, et al., “AgentBench: Evaluating LLMs as Agents,”International Conference on Learning Representations (ICLR), 2024

work page 2024
[38]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”International Conference on Learning Representations (ICLR), 2024

work page 2024
[39]

WebArena: A Realistic Web Environ- ment for Building Autonomous Agents,

S. Zhou, F. F. Xu, H. Zhu, et al., “WebArena: A Realistic Web Environ- ment for Building Autonomous Agents,”International Conference on Learning Representations (ICLR), 2024

work page 2024
[40]

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema,

M. Nasermoghadasi, “What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema,”IEEE International Conference on Big Data, 2026 (companion submission)

work page 2026

[1] [1]

On the Biology of a Large Language Model,

J. Lindsey, A. Templeton, et al., “On the Biology of a Large Language Model,” Anthropic Technical Report, 2025

work page 2025

[2] [2]

Tracing the Thoughts of a Large Language Model,

J. Lindsey, A. Templeton, et al., “Tracing the Thoughts of a Large Language Model,” Anthropic, 2025

work page 2025

[3] [3]

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning,

T. Bricken, A. Templeton, J. Batson, et al., “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning,” Anthropic Transformer Circuits Thread, 2023

work page 2023

[4] [4]

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,

A. Templeton, T. Conerly, J. Marcus, et al., “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet,” Anthropic Transformer Circuits Thread, 2024

work page 2024

[5] [5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models,

H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse Autoencoders Find Highly Interpretable Features in Language Models,” International Conference on Learning Representations (ICLR), 2024

work page 2024

[6] [6]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

T. Lieberum, S. Rajamanoharan, A. Conmy, et al., “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2,” arXiv preprint arXiv:2408.05147, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,

B. A. Olshausen and D. J. Field, “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,”Nature, vol. 381, no. 6583, pp. 607–609, 1996

work page 1996

[8] [8]

Zoom In: An Introduction to Circuits,

C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, “Zoom In: An Introduction to Circuits,”Distill, 2020

work page 2020

[9] [9]

Toy Models of Superposition,

N. Elhage, T. Hume, C. Olsson, et al., “Toy Models of Superposition,” Anthropic Transformer Circuits Thread, 2022

work page 2022

[10] [10]

Linear Algebraic Structure of Word Senses, with Applications to Polysemy,

S. Arora, Y . Li, Y . Liang, T. Ma, and A. Risteski, “Linear Algebraic Structure of Word Senses, with Applications to Polysemy,”Transac- tions of the Association for Computational Linguistics (TACL), vol. 6, pp. 483–495, 2018

work page 2018

[11] [11]

Online Dictionary Learning for Sparse Coding,

J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Dictionary Learning for Sparse Coding,”International Conference on Machine Learning (ICML), 2009

work page 2009

[12] [12]

Transcoders Find Interpretable LLM Feature Circuits,

J. Dunefsky, P. Chlenski, and N. Nanda, “Transcoders Find Interpretable LLM Feature Circuits,” arXiv preprint arXiv:2406.11944, 2024

work page arXiv 2024

[13] [13]

Sparse Crosscoders for Cross-Layer Features and Model Diffing,

J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, and C. Olah, “Sparse Crosscoders for Cross-Layer Features and Model Diffing,” Anthropic Transformer Circuits Thread, 2024

work page 2024

[14] [14]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small,

K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt, “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small,”International Conference on Learning Representations (ICLR), 2023

work page 2023

[15] [15]

Towards Automated Circuit Discovery for Mechanistic Interpretability,

A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, “Towards Automated Circuit Discovery for Mechanistic Interpretability,”Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[16] [16]

How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model,

M. Hanna, O. Liu, and A. Variengien, “How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model,”Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[17] [17]

Localizing Model Behavior with Path Patching

A. Goldowsky-Dill, C. MacLeod, L. Sato, and A. Arora, “Localizing Model Behavior with Path Patching,” arXiv preprint arXiv:2304.05969, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

S. Marks, C. Rager, E. Michaud, et al., “Sparse Feature Circuits: Dis- covering and Editing Interpretable Causal Graphs in Language Models,” arXiv preprint arXiv:2403.19647, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Causal Abstractions of Neural Networks,

A. Geiger, H. Lu, T. Icard, and C. Potts, “Causal Abstractions of Neural Networks,”Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[20] [20]

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,

J. Vig, S. Gehrmann, Y . Belinkov, et al., “Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,” arXiv preprint arXiv:2004.12265, 2020

work page arXiv 2004

[21] [21]

Locating and Editing Factual Associations in GPT,

K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and Editing Factual Associations in GPT,”Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[22] [22]

Open Source Sparse Autoencoders for All Residual Stream Layers of GPT2-Small,

J. Bloom, “Open Source Sparse Autoencoders for All Residual Stream Layers of GPT2-Small,” LessWrong / AI Alignment Forum, 2024

work page 2024

[23] [23]

SAELens: Training, Analyzing, and Visu- alizing Sparse Autoencoders,

J. Bloom, A. Kissane, et al., “SAELens: Training, Analyzing, and Visu- alizing Sparse Autoencoders,” https://github.com/jbloomAus/SAELens, 2024

work page 2024

[24] [24]

Neuronpedia: An Open Platform for Mechanistic Inter- pretability Research,

J. Lin, et al., “Neuronpedia: An Open Platform for Mechanistic Inter- pretability Research,” https://www.neuronpedia.org, 2024

work page 2024

[25] [25]

TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models,

N. Nanda and J. Bloom, “TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models,” https://github.com/ TransformerLensOrg/TransformerLens, 2022

work page 2022

[26] [26]

Language Models are Unsupervised Multitask Learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” OpenAI Tech- nical Report, 2019

work page 2019

[27] [27]

The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved,

B. L. Welch, “The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved,”Biometrika, vol. 34, no. 1- 2, pp. 28–35, 1947

work page 1947

[28] [28]

Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Routledge, 1988

work page 1988

[29] [29]

A Simple Sequentially Rejective Multiple Test Procedure,

S. Holm, “A Simple Sequentially Rejective Multiple Test Procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979

work page 1979

[30] [30]

Scikit-learn: Machine Learning in Python,

F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine Learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011

work page 2011

[31] [31]

An Introduction to ROC Analysis,

T. Fawcett, “An Introduction to ROC Analysis,”Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006

work page 2006

[32] [32]

Deep Reinforcement Learning that Matters,

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep Reinforcement Learning that Matters,”AAAI Conference on Artificial Intelligence, 2018

work page 2018

[33] [33]

Improving Repro- ducibility in Machine Learning Research,

J. Pineau, P. Vincent-Lamarre, K. Sinha, et al., “Improving Repro- ducibility in Machine Learning Research,”Journal of Machine Learning Research, vol. 22, no. 164, pp. 1–20, 2021

work page 2021

[34] [34]

Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,

M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,”Interna- tional Conference on Learning Representations (ICLR), 2024

work page 2024

[35] [35]

State of What Art? A Call for Multi-Prompt LLM Eval- uation,

M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of What Art? A Call for Multi-Prompt LLM Eval- uation,”Transactions of the Association for Computational Linguistics (TACL), 2024

work page 2024

[36] [36]

Fantasti- cally Ordered Prompts and Where to Find Them: Overcoming Few- Shot Prompt Order Sensitivity,

Y . Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantasti- cally Ordered Prompts and Where to Find Them: Overcoming Few- Shot Prompt Order Sensitivity,”Annual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022

[37] [37]

AgentBench: Evaluating LLMs as Agents,

X. Liu, H. Yu, H. Zhang, et al., “AgentBench: Evaluating LLMs as Agents,”International Conference on Learning Representations (ICLR), 2024

work page 2024

[38] [38]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”International Conference on Learning Representations (ICLR), 2024

work page 2024

[39] [39]

WebArena: A Realistic Web Environ- ment for Building Autonomous Agents,

S. Zhou, F. F. Xu, H. Zhu, et al., “WebArena: A Realistic Web Environ- ment for Building Autonomous Agents,”International Conference on Learning Representations (ICLR), 2024

work page 2024

[40] [40]

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema,

M. Nasermoghadasi, “What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema,”IEEE International Conference on Big Data, 2026 (companion submission)

work page 2026