Language Model Circuits Are Sparse in the Neuron Basis

Aryaman Arora; Jacob Steinhardt; Sarah Schwettmann; Zhengxuan Wu

arxiv: 2601.22594 · v2 · pith:RLL4NQFVnew · submitted 2026-01-30 · 💻 cs.CL · cs.AI

Language Model Circuits Are Sparse in the Neuron Basis

Aryaman Arora , Zhengxuan Wu , Jacob Steinhardt , Sarah Schwettmann This is my paper

classification 💻 cs.CL cs.AI

keywords modelneuronsbasiscircuitlanguageneuronsparsecomputation

0 comments

read the original abstract

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sparsely gated tiny linear experts
cs.LG 2026-06 unverdicted novelty 6.0

Sgatlin replaces transformer FF layers with sparse single linear neurons, improving perplexity across compute budgets and enabling direct interpretation of semantically clustered circuits for factual recall.
Fast & Faithful Function Vectors
cs.CL 2026-06 unverdicted novelty 4.0

LRP-based attention head selection and distributed application improve the efficiency and accuracy of function vectors for steering LLMs compared to prior choices.