Sparse Autoencoders Find Highly Interpretable Features in Language Models
Pith reviewed 2026-05-24 06:40 UTC · model grok-4.3
The pith
Sparse autoencoders recover sets of sparsely activating features from language model activations that are more interpretable and monosemantic than those found by prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sparse autoencoders trained to reconstruct language model activations learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches. With these features the authors can identify the ones causally responsible for counterfactual behavior on the indirect object identification task to a finer degree than previous decompositions. The work indicates that superposition can be resolved in language models by a scalable unsupervised method.
What carries the argument
Sparse autoencoders trained to reconstruct internal activations while encouraging sparse feature activations.
If this is right
- Superposition in language models can be resolved with a scalable unsupervised method.
- Mechanistic interpretability work can proceed from a dictionary of monosemantic features rather than polysemantic neurons.
- Causal responsibility for specific model behaviors can be attributed to individual features at higher resolution than before.
- Greater model transparency and steerability become feasible once features are isolated this way.
Where Pith is reading between the lines
- The same training procedure could be applied to activations from other model families or tasks to test whether the improvement in monosemanticity generalizes.
- If the learned features remain stable across different random seeds and training runs, they would provide a more reliable basis for editing model behavior.
- Combining the sparse autoencoder dictionary with causal tracing methods might allow systematic editing of high-level concepts without side effects on unrelated behaviors.
Load-bearing premise
The features found by the autoencoders correspond to genuine semantic concepts inside the language model rather than training artifacts, and the automated interpretability metrics accurately reflect human-understandable monosemanticity.
What would settle it
A direct test in which the identified features are ablated or scaled during a forward pass on the indirect object identification task and the expected change in output either fails to appear or appears for unrelated features.
Figures
read the original abstract
One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that sparse autoencoders applied to language model internal activations recover sparsely activating features that are more interpretable and monosemantic than directions from alternative approaches (with interpretability assessed via automated methods), and that these features enable finer-grained causal attribution of counterfactual behavior on the indirect object identification task than prior decompositions. It concludes that this provides a scalable unsupervised method for resolving superposition.
Significance. If the automated interpretability metrics are shown to track human judgments of monosemanticity and the recovered features are demonstrated to be model-internal concepts rather than training artifacts, the work would be significant as a practical, scalable tool for mechanistic interpretability. The IOI causal attribution result would strengthen the case for using such decompositions in downstream analysis.
major comments (2)
- [Abstract] Abstract: The central comparative claim that SAE features are 'more interpretable and monosemantic than directions identified by alternative approaches' rests entirely on unspecified automated methods. No details are given on the metrics themselves, their correlation with human judgments, or controls to rule out SAE-specific artifacts, making it impossible to evaluate whether the monosemanticity gains are genuine or artifactual. This is load-bearing for both the interpretability claim and the downstream IOI causal attribution result.
- [Abstract] Abstract: The claim that the method 'pinpoint[s] the features that are causally responsible for counterfactual behaviour on the indirect object identification task to a finer degree than previous decompositions' requires quantitative comparison (e.g., effect sizes, ablation results, or error bars) showing improvement over baselines such as the original IOI circuit analysis; the abstract supplies none, leaving the 'finer degree' assertion unsupported.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and detailed comments on our manuscript. We address each of the major comments below. We agree that the abstract would benefit from greater specificity on both points and will revise it accordingly while preserving its high-level nature.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central comparative claim that SAE features are 'more interpretable and monosemantic than directions identified by alternative approaches' rests entirely on unspecified automated methods. No details are given on the metrics themselves, their correlation with human judgments, or controls to rule out SAE-specific artifacts, making it impossible to evaluate whether the monosemanticity gains are genuine or artifactual. This is load-bearing for both the interpretability claim and the downstream IOI causal attribution result.
Authors: We agree the abstract does not detail the automated metrics. The manuscript (Sections 3 and 4) specifies the metrics as automated scoring of feature descriptions and activation contexts via a separate language model, with explicit baselines including random directions and PCA. Limited human correlation studies appear in Appendix B. Controls for artifacts are included via comparison to non-SAE decompositions. We will revise the abstract to briefly characterize the metrics and point to these sections. revision: yes
-
Referee: [Abstract] Abstract: The claim that the method 'pinpoint[s] the features that are causally responsible for counterfactual behaviour on the indirect object identification task to a finer degree than previous decompositions' requires quantitative comparison (e.g., effect sizes, ablation results, or error bars) showing improvement over baselines such as the original IOI circuit analysis; the abstract supplies none, leaving the 'finer degree' assertion unsupported.
Authors: The abstract summarizes results whose quantitative details (ablation effect sizes, comparisons to the original IOI circuit, and error bars across runs) are reported in Section 5. We will revise the abstract to include a short quantitative qualifier or reference to the effect-size improvements shown in the main text. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external automated metrics and task-specific causal tests
full rationale
The paper's core argument is an empirical demonstration that sparse autoencoders trained on language-model activations recover features judged more interpretable than baselines by automated scoring methods, plus finer causal attribution on the IOI task. No equation or claim reduces a reported prediction or uniqueness result to a fitted parameter or self-citation by construction; the reconstruction objective is standard and the interpretability comparison is performed against external baselines using metrics defined outside the fitted values. Self-citations appear only for background (e.g., the IOI task) and are not load-bearing for the central monosemanticity or superposition-resolution claims. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Polysemanticity is caused by superposition in neural networks
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Our autoencoder is trained to minimise the loss function L(x) = ||x − ˆx||²₂ + α||c||₁ ... reconstruction with an ℓ1 penalty can recover the ground-truth features
-
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.leanSatisfiesLawsOfLogic echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
sparse autoencoders to reconstruct the internal activations ... sets of sparsely activating features that are more interpretable and monosemantic
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
-
Markovian Circuit Tracing for Transformer State Dynamic
This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching i...
-
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models
Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
-
Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space
In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight...
-
Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE
KAN-SAE applies nonlinear per-feature B-spline activations in sparse autoencoders to discover 72% more alive climate features and interpretable patterns such as European heatwaves and Pacific typhoons in deep learning...
-
Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies
Event-grounded SAE analysis in VLA policies produces stronger causal effects on robot behavior than standard methods by anchoring features to clustered end-effector keyframes across simulations and real-robot tests.
-
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
-
Lost and Found in Translation: Variational Diagnostics for Neural Codebook Channels
Defines the neural codebook channel K_{e→d}(j|i) and proves a Bernoulli-KL bound on encoder-decoder mismatch in VAEs that cannot be recovered from marginal histograms or mutual information.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE factors sparse autoencoder decoder atoms to the native d_k x d_v cache write shape in recurrent models, provides a closed-form logit shift, and demonstrates high success in atom substitution and behavioral ed...
-
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
Interpreting Reinforcement Learning Agents with Susceptibilities
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
-
What Cohort INRs Encode and Where to Freeze Them
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
-
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
SMolLM: Small Language Models Learn Small Molecular Grammar
A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
-
Transformers with Selective Access to Early Representations
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
-
Transformers with Selective Access to Early Representations
SATFormer uses a learned context-dependent gate for selective access to early-layer value representations in Transformers, improving loss and accuracy over static residual baselines.
-
How Language Models Process Negation
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
-
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in ...
-
Linear-Readout Floors and Threshold Recovery in Computation in Superposition
Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contr...
-
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
-
A Unifying Framework for Unsupervised Concept Extraction
A meta-theorem reduces establishing identifiability guarantees for concept extraction methods to the problem of characterizing the intersection of two sets.
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
Grokking of Diffusion Models: Case Study on Modular Addition
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
-
Diverse Dictionary Learning
Diverse dictionary learning identifies intersections, complements, and dependency structures of latent variables from data X = g(Z) up to indeterminacies, and full identifiability when structural diversity is sufficient.
-
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
-
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model pr...
-
What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features
Effective multilingual reasoning in large models relies on language-specific patterns in reasoning features rather than uniform English-like traces.
-
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
-
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic as...
-
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM perf...
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models
Transcoders decompose MLP layers in Gemma 3-4B-IT to trace visual grounding more effectively than SAEs and predict hallucinations from circuit graph features at AUC 0.68.
-
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
-
Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models
SAE-FT uses a sparse autoencoder on pre-trained CLIP visual representations to regularize fine-tuning by penalizing changes to semantically meaningful features, aiming for robust performance on ImageNet and distributi...
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.
-
Why Retrieval-Augmented Generation Fails: A Graph Perspective
Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
-
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
TopK SAEs applied to EEG transformers extract clinical features, enable concept steering, and identify selectively steerable, entangled, and non-encoded regimes with a spectral decoder for physiological interpretation.
-
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Sparse autoencoders on EEG transformers extract clinical features, identify three steering regimes, expose age-pathology entanglements and wrecking-ball failures, and map interventions to frequency spectra.
-
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Sparse autoencoders on EEG transformers identify three regimes of clinical concept encoding and reveal entanglements such as age-pathology confounding via a new steering selectivity metric.
-
Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture
Introduces a trust-boundary architecture in Lean 4 with three certificate families and two operators that deliver sorry-free, axiom-audited assurances for LLM pipeline components.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 po...
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Pythia: A suite for analyzing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.\ 2397--2430. PMLR, 2023
work page 2023
-
[3]
Language models can explain neurons in language models
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2023
work page 2023
-
[4]
Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea Voss, Ludwig Schubert, and Chris Olah. Curve circuits. Distill, 2021. doi:10.23915/distill.00024.006. https://distill.pub/2020/circuits/curve-circuits
-
[5]
Towards automated circuit discovery for mechanistic interpretability
Arthur Conmy, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997, 2023
-
[6]
Adaptively sparse transformers
Gon c alo M Correia, Vlad Niculae, and Andr \'e FT Martins. Adaptively sparse transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2174--2184, 2019
work page 2019
-
[7]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021
work page 2021
-
[9]
Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav K...
work page 2022
-
[10]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Privileged bases in the transformer residual stream, 2023
Nelson Elhage, Robert Lasenby, and Chris Olah. Privileged bases in the transformer residual stream, 2023. URL https://transformer-circuits.pub/2023/privileged-basis/index.html. Accessed: 2023-08-07
work page 2023
-
[12]
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Cognitron: A self-organizing multilayered neural network
Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural network. Biol. Cybern., 20 0 (3–4): 0 121–136, sep 1975. ISSN 0340-1200. doi:10.1007/BF00342633. URL https://doi.org/10.1007/BF00342633
-
[14]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[15]
Accelerating convolutional neural networks via activation map compression
Georgios Georgiadis. Accelerating convolutional neural networks via activation map compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7085--7095, 2019
work page 2019
-
[16]
An Overview of Catastrophic AI Risks
Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001, 2023
work page internal anchor Pith review arXiv 2023
-
[17]
Elite backprop: Training sparse interpretable neurons
Theodoros Kasioumis, Joe Townsend, and Hiroya Inakoshi. Elite backprop: Training sparse interpretable neurons. In NeSy, pp.\ 82--93, 2021
work page 2021
-
[18]
Efficient sparse coding algorithms
Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Efficient sparse coding algorithms. Advances in neural information processing systems, 19, 2006
work page 2006
-
[19]
Clara Meister, Stefan Lazov, Isabelle Augenstein, and Ryan Cotterell. Is sparse attention more interpretable? In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:235293798
work page 2021
-
[20]
The Alignment Problem from a Deep Learning Perspective , May 2025
Richard Ngo, Lawrence Chan, and S \"o ren Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022
-
[21]
Zoom in: An introduction to circuits
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5 0 (3): 0 e00024--001, 2020
work page 2020
-
[22]
Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37 0 (23): 0 3311--3325, 1997
work page 1997
-
[23]
Sparse coding of sensory inputs
Bruno A Olshausen and David J Field. Sparse coding of sensory inputs. Current opinion in neurobiology, 14 0 (4): 0 481--487, 2004
work page 2004
-
[24]
Analysis of the optimization landscapes for overcomplete representation learning
Qing Qu, Yuexiang Zhai, Xiao Li, Yuqian Zhang, and Zhihui Zhu. Analysis of the optimization landscapes for overcomplete representation learning. arXiv preprint arXiv:1912.02427, 2019
-
[25]
Taking features out of superposition with sparse autoencoders, 2023
Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders, 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition. Accessed: 2023-05-10
work page 2023
-
[26]
Investigating gender bias in language models using causal mediation analysis
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33: 0 12388--12401, 2020
work page 2020
-
[27]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
John Wright and Yi Ma. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022
work page 2022
-
[29]
Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021
-
[30]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[31]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[32]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.