Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Pith reviewed 2026-06-29 07:47 UTC · model grok-4.3
The pith
Sparse autoencoders extract up to 34 million interpretable features from a production-scale language model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal, respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract
What carries the argument
Sparse autoencoders trained on middle-layer residual stream activations to produce a dictionary of features.
If this is right
- Features generalize to image inputs even though training used only text.
- Activation of harm-related features such as those for deception or bias alters model generations in the direction predicted by the feature interpretation.
- The same feature can respond to both concrete examples and abstract discussion of the same concept.
- Geometric and functional analyses of the learned features reveal additional regularities in how the model organizes its representations.
Where Pith is reading between the lines
- If the features prove faithful, the same method could be applied to additional layers to build a more complete map of model behavior.
- Similar dictionary learning could be tested on non-transformer architectures to check whether comparable monosemantic features appear.
- A direct test would be to train independent autoencoders on the same model activations and measure how consistently the same concepts are recovered.
- The ability to steer outputs via individual features suggests a route to targeted modification of model tendencies without retraining the entire system.
Load-bearing premise
The features recovered by the autoencoders correspond to the language model's actual internal computations rather than arising as side effects of how the autoencoders were trained.
What would settle it
An intervention experiment in which activating a feature labeled as representing deception produces no measurable increase in deceptive outputs on a held-out set of prompts.
read the original abstract
We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that sparse autoencoders (SAEs) with up to 34 million features can be trained on the middle-layer residual stream of Claude 3 Sonnet using scaling laws for hyperparameter selection, yielding features that are multilingual and multimodal (despite text-only training), respond to both concrete and abstract concepts, enable causal steering consistent with human interpretations, and include features for entities, sarcasm, code errors, and safety-relevant behaviors such as deception, power-seeking, sycophancy, and bias. It presents analyses of feature interpretability, geometry, and computational function while explicitly noting that the feature suite is incomplete and that rigorous methods for evaluating whether features faithfully capture model computations are lacking.
Significance. If the features are shown to be faithful to the model's computations, the work would be a significant advance in mechanistic interpretability by providing the first large-scale demonstration that dictionary learning scales to production frontier models, enabling systematic discovery of concepts and causal interventions on behaviors including those relevant to AI safety.
major comments (2)
- [Abstract] Abstract: The central scaling claim—that SAEs extract interpretable features from a production-scale model and thereby address whether dictionary learning works beyond small transformers—requires that the discovered features reflect the model's actual internal computations rather than primarily reflecting SAE training dynamics (L1 penalty, reconstruction loss, or initialization). The abstract states that no rigorous evaluation methods exist for this faithfulness question, and the reported evidence (human inspection, steering results, multilingual/multimodal generalization) is consistent with but does not establish model-native features.
- [Abstract] Abstract and limitations discussion: The explicit acknowledgment that the feature suite is incomplete and that faithfulness cannot be rigorously evaluated means the causal steering results and interpretability claims remain provisional; without a concrete test distinguishing model computations from SAE artifacts, the scaling demonstration does not yet fully resolve the open question posed in the abstract.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for emphasizing the distinction between SAE artifacts and model-native features. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central scaling claim—that SAEs extract interpretable features from a production-scale model and thereby address whether dictionary learning works beyond small transformers—requires that the discovered features reflect the model's actual internal computations rather than primarily reflecting SAE training dynamics (L1 penalty, reconstruction loss, or initialization). The abstract states that no rigorous evaluation methods exist for this faithfulness question, and the reported evidence (human inspection, steering results, multilingual/multimodal generalization) is consistent with but does not establish model-native features.
Authors: We agree that the evidence presented—human interpretability judgments, causal steering results, and cross-lingual/cross-modal generalization—is consistent with model-native features but does not constitute rigorous proof that the features are free of SAE training artifacts. The manuscript already states this limitation explicitly in both the abstract and the limitations discussion. Our central claim is narrower: that dictionary learning can be scaled to a production model while producing features that exhibit the reported properties under current evaluation methods. The use of scaling laws for hyperparameter selection and the consistency of steering outcomes with independent interpretations provide additional support beyond what was available for smaller models, even if a definitive faithfulness test remains unavailable. revision: no
-
Referee: [Abstract] Abstract and limitations discussion: The explicit acknowledgment that the feature suite is incomplete and that faithfulness cannot be rigorously evaluated means the causal steering results and interpretability claims remain provisional; without a concrete test distinguishing model computations from SAE artifacts, the scaling demonstration does not yet fully resolve the open question posed in the abstract.
Authors: We concur that the results are provisional precisely because no rigorous faithfulness test exists, as the paper states. The abstract is deliberately structured to pose the scaling question and then immediately qualify the claims with the acknowledged limitations. We maintain that demonstrating successful training and interpretable, steerable features at 34 million scale on a frontier model constitutes progress on the open question, even while the field lacks methods to fully separate model computations from SAE artifacts. The incompleteness of the feature suite is likewise already noted and does not undermine the scaling result for the features that were recovered. revision: no
- A concrete test distinguishing model computations from SAE artifacts
Circularity Check
Empirical scaling demonstration with no circular derivation steps
full rationale
The paper reports direct training of sparse autoencoders on Claude 3 Sonnet activations, followed by empirical observations of feature interpretability via inspection, steering, and generalization tests. No equations, predictions, or first-principles claims are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The acknowledged limitation on faithfulness evaluation is an explicit open question rather than a hidden circularity. This is self-contained empirical work.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of features =
up to 34 million
- sparsity hyperparameter
axioms (2)
- domain assumption Sparse autoencoders recover interpretable, disentangled features from model residual streams
- domain assumption The middle layer residual stream activations contain semantically meaningful information suitable for dictionary learning
Forward citations
Cited by 2 Pith papers
-
Evidence for feature-specific error correction in LLMs
Perturbation experiments across six LLMs show activation robustness follows L^p norm with p>2 for feature directions (contrastive, MELBO, SAE) but p≈2 for random/PCA controls, indicating feature-specific error correction.
-
HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization
HydraHead hybridizes full and linear attention along the head dimension via interpretability-driven selection and scale-normalized fusion, matching layer-wise hybrids at higher linear ratios after 15B-token training.
Reference graph
Works this paper leans on
-
[1]
Research report: Sparse autoencoders find only 9/180 board state fea- tures in othellogpt, 2024
Robert AIZI. Research report: Sparse autoencoders find only 9/180 board state fea- tures in othellogpt, 2024. URL https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/ research-report-sparse-autoencoders-find-only-9-180-board
2024
-
[2]
Evan Anders, Clement Neo, Jason Hoelscher-Obermaier, and Jessica N. Howard. Sparse autoen- coders find composed features in small toy models, 2024. URL https://www.lesswrong.com/posts/ a5wwqza2cY3W7L9cj/sparse-autoencoders-find-composed-features-in-small-toy
2024
-
[3]
Linear algebraic structure of word senses, with applications to polysemy
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. URL https://aclanthology.org/Q18-1034.pdf
2018
-
[4]
Using features for easy circuit identification, 2024
Joshua Batson, Brian Chen, and Andy Jones. Using features for easy circuit identification, 2024. URL https://transformer-circuits.pub/2024/march-update/index.html#feature-heads
2024
-
[5]
Leace: Perfect linear concept erasure in closed form, 2023
Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Bider- man. Leace: Perfect linear concept erasure in closed form, 2023. URL https://arxiv.org/pdf/2306. 03819
2023
-
[6]
Representation Learning: A Review and New Perspectives
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828, 2013. URL https://arxiv.org/pdf/1206.5538
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[7]
Language models can explain neurons in language models, 2023
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models, 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
2023
-
[8]
Open source sparse autoencoders for all residual stream layers of gpt2-small, 2024
Joseph Bloom. Open source sparse autoencoders for all residual stream layers of gpt2-small, 2024. URL https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/ open-source-sparse-autoencoders-for-all-residual-stream . 55
2024
-
[9]
Man is to computer programmer as woman is to homemaker? debiasing word embeddings
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems , 29, 2016. URL https://proceedings.neurips.cc/paper_files/ paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf
2016
-
[10]
Identifying functionally im- portant features with end-to-end sparse dictionary learning, 2024
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. Identifying functionally im- portant features with end-to-end sparse dictionary learning, 2024. URL https://publications. apolloresearch.ai/end_to_end_sparse_dictionary_learning.pdf
2024
-
[11]
Towards monosemanticity: Decomposing language models with dictionary learning
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...
2023
-
[12]
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827 , 2022. URL https://arxiv.org/pdf/ 2212.03827
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Infogan: Interpretable representation learning by information maximizing generative adversarial nets
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems , 29, 2016. URL https://proceedings.neurips.cc/paper_ files/paper/2016/file/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Paper.pdf
2016
-
[14]
Eliciting latent knowledge: How to tell if your eyes deceive you
Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Google Docs, December , 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_ Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?tab=t.0#heading=h.kkaua0hwmp1d
2021
-
[15]
Activation steering with saes,
Arthur Conmy and Neel Nanda. Activation steering with saes,
-
[16]
URL https://www.lesswrong.com/posts/C5KAZQib3bzzpeyrg/ full-post-progress-update-1-from-the-gdm-mech-interp-team#Activation_Steering_with_ SAEs
-
[17]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, Logan Smith, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable model directions. arXiv preprint arXiv:2309.08600 , 2023. URL https:// arxiv.org/pdf/2309.08600
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
On measuring and mitigating biased inferences of word embeddings
Sunipa Dev, Tao Li, Jeff M Phillips, and Vivek Srikumar. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 7659–7666, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/6267/6123
2020
-
[19]
Transcoders find interpretable llm feature circuits
Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits. Advances in Neural Information Processing Systems , 37:24375–24410, 2025. URL https://arxiv.org/ abs/2406.11944
-
[20]
Sparse and redundant representations: from theory to applications in signal and image processing, volume 2
Michael Elad. Sparse and redundant representations: from theory to applications in signal and image processing, volume 2. Springer, 2010. 56
2010
-
[21]
Softmax linear units
Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Ka- mal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav...
2022
-
[22]
Toy models of superposition.Trans- former Circuits Thread , 2022
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Trans- former Circuits Thread , 2022. URL https://transformer-circuits.pub/...
2022
-
[23]
Privileged bases in the transformer resid- ual stream
Nelson Elhage, Robert Lasenby, and Christopher Olah. Privileged bases in the transformer resid- ual stream. Transformer Circuits Thread , 2023. URL https://transformer-circuits.pub/2023/ privileged-basis/index.html
2023
-
[24]
Sparse Overcomplete Word Vector Representations
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004 , 2015. URL https://arxiv.org/pdf/ 1506.02004
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[25]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv. org/pdf/2101.03961
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
Common crawl
The Common Crawl Foundation. Common crawl. URL https://commoncrawl.org
-
[27]
Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024
Hugo Fry. Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024. URL https://www.lesswrong.com/posts/bCtbuWraqYTDtuARg/ towards-multimodal-interpretability-learning-sparse-2
2024
-
[28]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URL https://arxiv.org/pdf/2101.00027
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[29]
Decoding the thought vector, 2016
Gabriel Goh. Decoding the thought vector, 2016. URLhttps://gabgoh.github.io/ThoughtVectors/
2016
-
[30]
Sae reconstruction errors are (empirically) pathologi- cal, 2024
Wes Gurnee. Sae reconstruction errors are (empirically) pathologi- cal, 2024. URL https://www.lesswrong.com/posts/rZPiuFxESMxCDHe4B/ sae-reconstruction-errors-are-empirically-pathological
2024
-
[31]
arXiv preprint arXiv:2310.02207 , year=
Wes Gurnee and Max Tegmark. Language models represent space and time, 2024. URL https: //arxiv.org/pdf/2310.02207
-
[32]
Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng, and Xipeng Qiu. Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt. arXiv preprint arXiv:2402.12201 , 2024. URL https://arxiv.org/pdf/2402.12201. 57
-
[33]
Superposition, memorization, and double descent.Transformer Circuits Thread, 2023
Tom Henighan, Shan Carter, Tristan Hume, Nelson Elhage, Robert Lasenby, Stanislav Fort, Nicholas Schiefer, and Christopher Olah. Superposition, memorization, and double descent.Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/toy-double-descent/index.html
2023
-
[34]
Natural language descriptions of deep visual features
Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InInternational Conference on Learning Representations, 2021. URL https://arxiv.org/pdf/2201.11114
-
[35]
beta-vae: Learning basic visual concepts with a constrained varia- tional framework
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained varia- tional framework. 2016. URL https://openreview.net/pdf?id=Sy2fzU9gl
2016
-
[36]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute- optimal large language models. arXiv preprint arXiv:2203.15556 , 2022. URL https://arxiv.org/ pdf/2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Rad- hakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shaun...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
On the ”steerability” of generative adversarial networks
Ali Jahanian, Lucy Chai, and Phillip Isola. On the ”steerability” of generative adversarial networks. arXiv preprint arXiv:1907.07171 , 2019. URL https://arxiv.org/pdf/1907.07171
-
[39]
Features in an 8-layer model, 2024
Adam Jermyn, Tom Conerly, Trenton Bricken, and Adly Templeton. Features in an 8-layer model, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html#dict-learning
2024
-
[40]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 , 2022. URL https://arxiv.org/pdf/2207.05221
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/pdf/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[42]
Disentangling by factorising
Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Ma- chine Learning, pages 2649–2658. PMLR, 2018. URL http://proceedings.mlr.press/v80/kim18b/ kim18b.pdf
2018
-
[43]
Sparse autoencoders work on attention layer outputs, 2024
Connor Kissane, robertzk, Arthur Conmy, and Neel Nanda. Sparse autoencoders work on attention layer outputs, 2024. URL https://www.lesswrong.com/posts/DtdzGwFh9dCfsekZZ/ sparse-autoencoders-work-on-attention-layer-outputs . 58
2024
-
[44]
Atp*: An efficient and scalable method for localizing llm behaviour to components
János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Atp*: An efficient and scalable method for localizing llm behaviour to components. arXiv preprint arXiv:2403.00745 , 2024. URL https: //arxiv.org/pdf/2403.00745
-
[45]
Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task.arXiv preprint arXiv:2210.13382, 2022. URL https://arxiv.org/pdf/2210.13382
-
[46]
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023. URL https://arxiv.org/pdf/ 2306.03341
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
How strongly do dictionary learning features influence model behavior?, 2024
Jack Lindsey. How strongly do dictionary learning features influence model behavior?, 2024. URL https://transformer-circuits.pub/2024/april-update/index.html#ablation-exps
2024
-
[48]
Simple probes can catch sleeper agents, 2024
Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duve- naud, Sam Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hub- inger. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/news/ probes-catch-sleeper-agents
2024
-
[49]
Eliciting latent knowledge from quirky language models
Alex Mallen and Nora Belrose. Eliciting latent knowledge from quirky language models. arXiv preprint arXiv:2312.01037, 2023. URL https://arxiv.org/pdf/2312.01037
-
[50]
Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824 , 2023. URL https: //arxiv.org/pdf/2310.06824
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
dictionary_learning github repository, 2024
Samuel Marks, Adam Karvonen, and Aaron Mueller. dictionary_learning github repository, 2024. URL https://github.com/saprmarks/dictionary_learning
2024
-
[52]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models.arXiv preprint arXiv:2403.19647, 2024. URL https://arxiv.org/pdf/2403.19647
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Linguistic regularities in continuous space word representations
Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages 746–751, 2013. URL https:// aclanthology.org/N13-1090.pdf
2013
-
[54]
Transformer debugger, 2024
Dan Mossing, Steven Bills, Henk Tillman, Tom Dupré la Tour, Nick Cammarata, Leo Gao, Joshua Achiam, Catherine Yeh, Jan Leike, Jeff Wu, and William Saunders. Transformer debugger, 2024. URL https://github.com/openai/transformer-debugger
2024
-
[55]
Actually, othello-gpt has a linear emergent world representation, 2023
Neel Nanda. Actually, othello-gpt has a linear emergent world representation, 2023. URL https: //www.neelnanda.io/mechanistic-interpretability/othello
2023
-
[56]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratch- pads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021. URL https://arxiv.org/pdf/2112.00114. 59
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[57]
Zoom in: An introduction to circuits
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. URL https://distill. pub/2020/circuits/zoom-in. https://distill.pub/2020/circuits/zoom-in
-
[58]
Distributed representations: Composition & superposition, 2023
Christopher Olah. Distributed representations: Composition & superposition, 2023. URL https: //transformer-circuits.pub/2023/superposition-composition/index.html
2023
-
[59]
Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. doi: 10.1016/S0042-6989(97)00169-7. URL https://www.sciencedirect.com/science/article/pii/S0042698997001697
-
[60]
Mlp neurons - 40l preliminary investigation [rough early thoughts]
Catherine Olsson, Nelson Elhage, and Chris Olah. Mlp neurons - 40l preliminary investigation [rough early thoughts]. URL https://www.youtube.com/watch?v=8wYNsoycM1U
-
[61]
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015. URL https: //arxiv.org/pdf/1511.06434
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[62]
Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014 , 2024. URL https://arxiv.org/pdf/2404.16014
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Improving sae’s by sqrt()-ing l1 & removing low- est activating features, 2024
Logan Riggs and Jannik Brinkmann. Improving sae’s by sqrt()-ing l1 & removing low- est activating features, 2024. URL https://www.lesswrong.com/posts/YiGs8qJ8aNBgwt2YN/ improving-sae-s-by-sqrt-ing-l1-and-removing-lowest
2024
-
[64]
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, 2024. URLhttps://arxiv.org/pdf/2312.06681
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Polysemanticity and capacity in neural networks
Adam Scherlis, Kshitij Sachan, Adam S Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks. arXiv preprint arXiv:2210.01892 , 2022. URL https://arxiv.org/pdf/ 2210.01892
-
[66]
Spine: Sparse interpretable neural embeddings
Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard Hovy. Spine: Sparse interpretable neural embeddings. In Proceedings of the AAAI Con- ference on Artificial Intelligence , volume 32, 2018. URL https://cdn.aaai.org/ojs/11935/ 11935-13-15463-1-2-20201228.pdf
2018
-
[67]
Attribution patching outperforms automated circuit discovery
Aaquib Syed, Can Rager, and Arthur Conmy. attribution patching outperforms automated circuit discovery. arXiv preprint arXiv:2310.10348 , 2023. URL https://arxiv.org/pdf/2310.10348
-
[68]
Alex Tamkin, Mohammad Taufeeque, and Noah D Goodman. Codebook features: Sparse and discrete interpretability for neural networks.arXiv preprint arXiv:2310.17230, 2023. URL https://arxiv.org/ pdf/2310.17230
-
[69]
Predicting future activations, 2024
Adly Templeton, Joshua Batson, Adam Jermyn, and Chris Olah. Predicting future activations, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html#predict-future
2024
-
[70]
Do sparse autoencoders find ”true features”?, 2024
Demian Till. Do sparse autoencoders find ”true features”?, 2024. URL https://www.lesswrong.com/ posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features . 60
2024
-
[71]
Function vectors in large language models
Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. arXiv preprint arXiv:2310.15213 , 2023. URL https:// arxiv.org/pdf/2310.15213
-
[72]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2023. URL https://arxiv.org/ pdf/2308.10248
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Deep feature interpolation for image content changes
Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and Kilian Weinberger. Deep feature interpolation for image content changes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7064–7073, 2017. URLhttps://openaccess.thecvf. com/content_cvpr_2017/papers/Upchurch_Deep_Feature_Interpolat...
2017
-
[74]
Toward a mathematical framework for com- putation in superposition, 2024
Dmitry Vaintrob, Jake Mendel, and Kaarel Hänni. Toward a mathematical framework for com- putation in superposition, 2024. URL https://www.lesswrong.com/posts/2roZtSr5TGmLjXMnT/ toward-a-mathematical-framework-for-computation-in
2024
-
[75]
Addressing feature suppression in saes, 2024
Benjamin Wright and Lee Sharkey. Addressing feature suppression in saes, 2024. URL https://www. lesswrong.com/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes
2024
-
[76]
Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021. URL https://arxiv.org/pdf/2103.15949
-
[77]
Word embedding visualization via dictionary learning
Juexiao Zhang, Yubei Chen, Brian Cheung, and Bruno A Olshausen. Word embedding visualization via dictionary learning. arXiv preprint arXiv:1910.03833 , 2019. URL https://arxiv.org/pdf/1910. 03833
-
[78]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top- down approach to ai transparency. arXiv preprint arXiv:2310.01405 , 2023. URL https://arxiv.org/ pdf/2310.01405. 61 A Author Contributions A.1 Infrastructure, T ooling, and ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.