Recognition: no theorem link
Disentangling MLP Neuron Weights in Vocabulary Space
Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3
The pith
Optimizing rotations of MLP neuron weights to maximize vocabulary kurtosis recovers faithful sparse channels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Neurons encoding coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, the method recovers sparse, interpretable directions named vocabulary channels. These channels remain faithful to the neuron's behavior, as ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Aggregating channel-level descriptions produces comprehensive neuron explanations that surpass optimized activation-based baselines.
What carries the argument
ROTATE rotation search, which finds an orthogonal transformation of each neuron's weight vector that maximizes the kurtosis of its dot products with all vocabulary token embeddings, thereby extracting a set of vocabulary channels.
Load-bearing premise
Neurons that represent single coherent concepts will show distinctly higher kurtosis than other neurons once their weights are projected into vocabulary space.
What would settle it
An ablation test in which a recovered vocabulary channel is zeroed yet the model's activation or output change does not match the concept described for that channel.
Figures
read the original abstract
Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron's behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ROTATE, a data-free method that rotates MLP neuron weight vectors to maximize kurtosis of their projections onto the model's vocabulary embedding space. This is claimed to recover sparse, interpretable 'vocabulary channels' that are faithful to the original neuron's causal behavior. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it report consistent recovery of such channels, with ablations selectively disabling input activations or concept promotion, and aggregated channel descriptions outperforming activation-based baselines by 2-3x in head-to-head neuron description tasks.
Significance. If the kurtosis-maximization procedure reliably isolates causally relevant directions rather than incidental statistical artifacts, the method would supply a scalable, activation-free primitive for decomposing MLP neurons in weight space. This could complement existing activation-based and patching techniques in mechanistic interpretability, particularly for large models where data collection is costly.
major comments (3)
- [Abstract / Method description] The central premise that 'neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary' is stated as a key statistical observation but is not accompanied by any empirical validation, ablation, or theoretical derivation showing that this correlation is reliable, unique to monosemantic neurons, or stronger than for frequency-biased or non-functional directions.
- [Experiments / Ablation results] The faithfulness claim rests on ablation results that 'selectively disable corresponding input activations or the promotion of specific concepts,' yet the manuscript provides no comparison of these effects against non-optimized high-kurtosis directions, random rotations, or directions obtained by direct behavioral methods such as activation maximization or causal patching; without such controls it is unclear whether the optimization recovers the neuron's actual computational direction.
- [Experiments / Head-to-head comparisons] The reported 2-3x outperformance of aggregated channel descriptions over 'optimized activation-based baselines' is presented without specifying the exact evaluation metric (e.g., human preference, automated scoring), number of neurons or concepts evaluated, statistical significance tests, or precise baseline implementations, rendering the quantitative superiority difficult to interpret or reproduce.
minor comments (2)
- Abstract contains a sentence-initial lowercase 'ablating' that should be capitalized.
- The term 'vocabulary channels' is introduced without a formal definition or notation distinguishing it from the rotated weight vector itself.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight areas where the manuscript can be strengthened. We address each major point below and will incorporate revisions to improve clarity, add controls, and provide missing details.
read point-by-point responses
-
Referee: [Abstract / Method description] The central premise that 'neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary' is stated as a key statistical observation but is not accompanied by any empirical validation, ablation, or theoretical derivation showing that this correlation is reliable, unique to monosemantic neurons, or stronger than for frequency-biased or non-functional directions.
Authors: We agree that the premise would benefit from explicit validation. In the revised manuscript we will add a dedicated subsection with empirical comparisons: kurtosis histograms for (i) neurons previously identified as monosemantic in the literature, (ii) random directions, and (iii) high-frequency or non-functional directions. We will also include a brief theoretical note explaining why kurtosis is expected to be higher for sparse, concept-aligned projections in vocabulary space. These additions will be placed in Section 3. revision: yes
-
Referee: [Experiments / Ablation results] The faithfulness claim rests on ablation results that 'selectively disable corresponding input activations or the promotion of specific concepts,' yet the manuscript provides no comparison of these effects against non-optimized high-kurtosis directions, random rotations, or directions obtained by direct behavioral methods such as activation maximization or causal patching; without such controls it is unclear whether the optimization recovers the neuron's actual computational direction.
Authors: The existing ablations already demonstrate targeted causal effects on the original neuron's activations and outputs. To address the request for controls, we will add results comparing the optimized rotations against (a) random rotations and (b) non-optimized high-kurtosis directions, confirming that only the optimized channels produce the selective ablation effects. Direct comparisons to activation maximization or causal patching are partially feasible but require activation data that ROTATE deliberately avoids; we will include a limited side-by-side evaluation on a subset of neurons where such data is already available, while noting the data-free advantage of our method. revision: partial
-
Referee: [Experiments / Head-to-head comparisons] The reported 2-3x outperformance of aggregated channel descriptions over 'optimized activation-based baselines' is presented without specifying the exact evaluation metric (e.g., human preference, automated scoring), number of neurons or concepts evaluated, statistical significance tests, or precise baseline implementations, rendering the quantitative superiority difficult to interpret or reproduce.
Authors: We apologize for the omitted details. The 2-3x figure reflects human preference scores in a blind pairwise comparison (three independent evaluators per pair) on 50 neurons per model. We will expand the evaluation section to report: the precise metric (preference win rate), total number of comparisons, inter-annotator agreement, statistical significance (paired t-test, p<0.01), and the exact procedure used to optimize the activation-based baselines (including hyper-parameters and prompt templates). These clarifications will be added to Section 4.3 and the appendix. revision: yes
Circularity Check
No circularity: kurtosis maximization is an independent objective; faithfulness validated externally
full rationale
The paper defines ROTATE as a rotation optimization that directly maximizes a fixed, pre-specified statistical quantity (vocabulary-space kurtosis) chosen because of an external empirical observation about monosemantic neurons. The resulting channels are then evaluated for behavioral faithfulness via separate, non-circular experiments (channel ablation disabling specific activations, head-to-head description quality against activation baselines). No equation or claim reduces the target result to the optimization objective by construction, no self-citation chain bears the central premise, and the motivating kurtosis-monosemantic link is treated as an assumption rather than derived from the method's own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary
invented entities (1)
-
vocabulary channels
no independent evidence
Reference graph
Works this paper leans on
-
[1]
doi: 10.18653/v1/2024.acl-long.841
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.841. URLhttps://aclanthology.org/2024.acl-long.841/. Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, and Mor Geva. Enhancing au- tomated interpretability with output-centric feature descriptions. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar ...
-
[2]
URLhttps://openreview.net/forum?id=aJDykpJAYF. Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. Glitch tokens in large language models: Categorization taxonomy and effective detection.Proc. ACM Softw. Eng., 1(FSE), July 2024. doi: 10.1145/ 3660799. URLhttps://doi.org/10.1145/3660799. Tom L...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3660799 2024
-
[3]
The activating token context, with the highest-activating token marked**like this**
-
[4]
It**wasn’t**what I expected at all
Five candidate descriptions: the description of v∗ (correct) and four distractors drawn uniformly at random from channels ofotherneurons in the same model and layer set. The judge selects the description it believes best explains why the neuron fired; we record a hit when it selects the correct description. Example.Below is a sample query for Neuron 9005 ...
-
[5]
Riding and locomotion contexts (horses, bikes, vehicles)
-
[6]
Polarity/negation constructions: contractions likewasn’t, didn’t, can’t.[correct]
-
[7]
Instruction-following and obedience vocabulary
-
[8]
Technical programming and software development tokens
-
[9]
The sentence contains ‘wasn’t’, a negation contraction. Description 2 best matches
Temporal markers indicating future scheduling. Judge response:“The sentence contains ‘wasn’t’, a negation contraction. Description 2 best matches.” The four distractor descriptions are sampled from random neurons in Gemma Layer 18. In this example the judge selects Description 2, the correct vocabulary channel. E.4 Patchscopes setup We use the Patchscopes...
2024
-
[10]
Identifies the common semantic or syntactic themes across channels
-
[11]
Explains what inputs activate this neuron
-
[12]
Notes any patterns in token appearance vs prediction
-
[13]
diverse set of linguistic and semantic features
Is specific enough to be useful but general enough to capture the neuron’s overall function Avoid vague descriptions:Do NOT use generic, uninformative descriptions like “diverse set of linguistic and semantic features” or “various textual patterns”. Be SPECIFIC about what causes {polarity} activations. If the channels are truly diverse, list the 2–3 most ...
-
[14]
What semantic or syntactic patterns appear in these{polarity}-activation examples?
-
[15]
How do the LogitLens tokens relate to{polarity}activation patterns?
-
[16]
diverse set of linguistic and semantic features
Is there a coherent theme? Avoid vague descriptions:Do NOT use generic, uninformative descriptions like “diverse set of linguistic and semantic features” or “various textual patterns”. Be SPECIFIC. If the examples are diverse, list the 2–3 most prominent specific patterns rather than using vague umbrella terms. Please return your answer in JSON format. Fi...
-
[17]
Identify the common semantic or syntactic theme among these tokens and examples
-
[18]
Provide a short description of what this channel likely represents or detects
-
[19]
max activation
The description should be specific but capture the general concept. Please return your answer in JSON format. Figure 18: Prompt used to describe a single vocabulary channel. Each channel is described independently before synthesis into a neuron-level description. Used with Gemini-2.0-Flash. 34 Preprint. Under review. Activating / neutral example generatio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.