arxiv: 2604.06005 · v1 · submitted 2026-04-07 · 💻 cs.CL

Recognition: no theorem link

Disentangling MLP Neuron Weights in Vocabulary Space

Asaf Avrahamy , Yoav Gur-Arieh , Mor Geva

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords MLP neuronsvocabulary spacekurtosis maximizationrotation optimizationmechanistic interpretabilitymonosemantic conceptsdata-free methodneuron disentanglement

0 comments

The pith

Optimizing rotations of MLP neuron weights to maximize vocabulary kurtosis recovers faithful sparse channels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a data-free technique that operates directly on MLP weight matrices to separate entangled neuron representations. It rests on the observation that neurons tied to single coherent concepts produce high kurtosis when their weights are multiplied by the model's vocabulary embeddings. A search for the best rotation matrix then isolates directions, called vocabulary channels, that capture distinct semantic aspects of the original neuron. Experiments confirm these channels match the neuron's actual behavior because zeroing one channel selectively removes specific input responses or concept promotions. When multiple channels are described and combined, the resulting neuron summary proves more accurate than summaries built from activation patterns alone.

Core claim

Neurons encoding coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, the method recovers sparse, interpretable directions named vocabulary channels. These channels remain faithful to the neuron's behavior, as ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Aggregating channel-level descriptions produces comprehensive neuron explanations that surpass optimized activation-based baselines.

What carries the argument

ROTATE rotation search, which finds an orthogonal transformation of each neuron's weight vector that maximizes the kurtosis of its dot products with all vocabulary token embeddings, thereby extracting a set of vocabulary channels.

Load-bearing premise

Neurons that represent single coherent concepts will show distinctly higher kurtosis than other neurons once their weights are projected into vocabulary space.

What would settle it

An ablation test in which a recovered vocabulary channel is zeroed yet the model's activation or output change does not match the concept described for that channel.

Figures

Figures reproduced from arXiv: 2604.06005 by Asaf Avrahamy, Mor Geva, Yoav Gur-Arieh.

**Figure 2.** Figure 2: Vocabulary kurtosis of concept vectors in Wout (Hong et al., 2025) vs. random neurons from the same layers. High kurtosis as a monosemanticity signal Given the above observations, we hypothesize that the distribution over the vocabulary induced by a weight vector could indicate how monosemantic it is. Specifically, we expect that monosemantic neurons will be correlated with higher kurtosis values of thei… view at source ↗

**Figure 3.** Figure 3: Input-side causal validity. Ablating the neuron’s top channel drives its activation toward 0; ablating other channels leaves it near 1. Causal validity via channel ablation To test whether channels are causally responsible for the neuron’s activation, we ablate the channel v from the neuron’s weight vector w by projecting out its contribution: wablated = w − (w · v) v. Then, we compare the neuron activati… view at source ↗

**Figure 4.** Figure 4: Head-to-head pairwise evaluation of ROTATE vocabulary channel descriptions against MaxAct+VocabProj and MaxAct++ baselines on Llama-3.1-8B-Instruct. Each bar shows the fraction of comparisons won by ROTATE, tied, or won by the baseline. Columns correspond to layers; rows to evaluation data sources and activation-rank ranges. This reflects a basic trade-off: activation-based methods condition on extreme res… view at source ↗

**Figure 5.** Figure 5: A distribution with high kurtosis and positive skewness, concentrated around [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Median vocabulary kurtosis values of neuron weights in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Per-layer vocabulary kurtosis distributions of [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Weight reconstruction analysis on Gemma-2-2B-it. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Consistency of ROTATE across different initializations. The heatmap displays a pairwise cosine similarity between vocabulary channels discovered in two separate execution runs (Execution 1 vs. Execution 2) for the same target neuron. C.4 Avoiding glitch tokens A practical challenge we encountered is that the optimization frequently converges to “glitch tokens” (Li et al., 2024), which are under-trained … view at source ↗

**Figure 10.** Figure 10: Effect of token masking on iterative decomposition quality. We compare [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation results evaluating weight reconstruction across optimization iterations [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Complete mechanistic decomposition of Neuron 9005 (Layer 18, Gemma-2-2Bit) via vocabulary channels. Top: The neuron activates positively on technical text with negation/polarity concepts and negatively on temporal deferral. Middle: ROTATE’s inputside wgate and win channels explain the sign of the activation, the wgate detects relevant context, while the win channel’s alignment or anti-alignment with the… view at source ↗

**Figure 14.** Figure 14: Per-channel faithfulness scores for representative gate channels of Neuron 9005 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Polarity-split neuron description synthesis prompt ( [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

**Figure 16.** Figure 16: MaxAct+VocabProj baseline description prompt ( [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗

**Figure 17.** Figure 17: Head-to-head pairwise evaluation prompt ( [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt used to describe a single vocabulary channel. Each channel is described [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt used to generate activating and neutral examples for the input-side [PITH_FULL_IMAGE:figures/full_fig_p035_19.png] view at source ↗

**Figure 20.** Figure 20: 5-way channel matching prompt used for the completeness evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

read the original abstract

Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron's behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ROTATE rotates MLP weights to maximize vocabulary kurtosis and extract sparse channels, offering a data-free alternative that beats activation baselines on two models but rests on an unproven link between kurtosis and behavioral faithfulness.

read the letter

This paper's core move is ROTATE: rotate the weights of an MLP neuron to maximize the kurtosis of its projection onto the model's vocabulary, then treat the resulting sparse directions as vocabulary channels. The claim is that these channels are faithful to the neuron's actual behavior and that stacking their descriptions gives better neuron-level explanations than activation-based methods, with a reported 2-3x edge on Llama-3.1-8B-Instruct and Gemma-2-2B-it. Ablations that turn off individual channels are said to disable the expected input activations or concept promotion. The data-free nature is the practical advantage; no forward passes or activation collection required. That alone makes it worth examining for anyone trying to scale interpretability beyond small models. The optimization itself is simple and targets a fixed statistical property rather than fitting parameters to observed outputs, which keeps it from being circular in an obvious way. The authors also test on reasonably current models and show consistent recovery across them. The main weakness is the central assumption that monosemantic, functionally relevant directions will reliably produce high kurtosis under vocabulary projection, and that gradient optimization will land on the causally active one rather than some other high-kurtosis artifact. If frequency biases or non-functional patterns can also drive high kurtosis, the recovered channels could look clean without matching what the neuron actually computes. The abstract does not give the precise metrics, significance tests, or controls used for the 2-3x claim, so it is difficult to judge how much of the improvement is robust versus sensitive to evaluation choices. Direct comparisons to activation maximization or patching results would tighten this. Readers working on neuron-level mechanistic interpretability or cheap model editing primitives will get the most out of it. The idea is distinct enough from dictionary learning or activation patching that it merits referee time even if the validation needs strengthening.

Referee Report

3 major / 2 minor

Summary. The paper introduces ROTATE, a data-free method that rotates MLP neuron weight vectors to maximize kurtosis of their projections onto the model's vocabulary embedding space. This is claimed to recover sparse, interpretable 'vocabulary channels' that are faithful to the original neuron's causal behavior. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it report consistent recovery of such channels, with ablations selectively disabling input activations or concept promotion, and aggregated channel descriptions outperforming activation-based baselines by 2-3x in head-to-head neuron description tasks.

Significance. If the kurtosis-maximization procedure reliably isolates causally relevant directions rather than incidental statistical artifacts, the method would supply a scalable, activation-free primitive for decomposing MLP neurons in weight space. This could complement existing activation-based and patching techniques in mechanistic interpretability, particularly for large models where data collection is costly.

major comments (3)

[Abstract / Method description] The central premise that 'neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary' is stated as a key statistical observation but is not accompanied by any empirical validation, ablation, or theoretical derivation showing that this correlation is reliable, unique to monosemantic neurons, or stronger than for frequency-biased or non-functional directions.
[Experiments / Ablation results] The faithfulness claim rests on ablation results that 'selectively disable corresponding input activations or the promotion of specific concepts,' yet the manuscript provides no comparison of these effects against non-optimized high-kurtosis directions, random rotations, or directions obtained by direct behavioral methods such as activation maximization or causal patching; without such controls it is unclear whether the optimization recovers the neuron's actual computational direction.
[Experiments / Head-to-head comparisons] The reported 2-3x outperformance of aggregated channel descriptions over 'optimized activation-based baselines' is presented without specifying the exact evaluation metric (e.g., human preference, automated scoring), number of neurons or concepts evaluated, statistical significance tests, or precise baseline implementations, rendering the quantitative superiority difficult to interpret or reproduce.

minor comments (2)

Abstract contains a sentence-initial lowercase 'ablating' that should be capitalized.
The term 'vocabulary channels' is introduced without a formal definition or notation distinguishing it from the rotated weight vector itself.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight areas where the manuscript can be strengthened. We address each major point below and will incorporate revisions to improve clarity, add controls, and provide missing details.

read point-by-point responses

Referee: [Abstract / Method description] The central premise that 'neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary' is stated as a key statistical observation but is not accompanied by any empirical validation, ablation, or theoretical derivation showing that this correlation is reliable, unique to monosemantic neurons, or stronger than for frequency-biased or non-functional directions.

Authors: We agree that the premise would benefit from explicit validation. In the revised manuscript we will add a dedicated subsection with empirical comparisons: kurtosis histograms for (i) neurons previously identified as monosemantic in the literature, (ii) random directions, and (iii) high-frequency or non-functional directions. We will also include a brief theoretical note explaining why kurtosis is expected to be higher for sparse, concept-aligned projections in vocabulary space. These additions will be placed in Section 3. revision: yes
Referee: [Experiments / Ablation results] The faithfulness claim rests on ablation results that 'selectively disable corresponding input activations or the promotion of specific concepts,' yet the manuscript provides no comparison of these effects against non-optimized high-kurtosis directions, random rotations, or directions obtained by direct behavioral methods such as activation maximization or causal patching; without such controls it is unclear whether the optimization recovers the neuron's actual computational direction.

Authors: The existing ablations already demonstrate targeted causal effects on the original neuron's activations and outputs. To address the request for controls, we will add results comparing the optimized rotations against (a) random rotations and (b) non-optimized high-kurtosis directions, confirming that only the optimized channels produce the selective ablation effects. Direct comparisons to activation maximization or causal patching are partially feasible but require activation data that ROTATE deliberately avoids; we will include a limited side-by-side evaluation on a subset of neurons where such data is already available, while noting the data-free advantage of our method. revision: partial
Referee: [Experiments / Head-to-head comparisons] The reported 2-3x outperformance of aggregated channel descriptions over 'optimized activation-based baselines' is presented without specifying the exact evaluation metric (e.g., human preference, automated scoring), number of neurons or concepts evaluated, statistical significance tests, or precise baseline implementations, rendering the quantitative superiority difficult to interpret or reproduce.

Authors: We apologize for the omitted details. The 2-3x figure reflects human preference scores in a blind pairwise comparison (three independent evaluators per pair) on 50 neurons per model. We will expand the evaluation section to report: the precise metric (preference win rate), total number of comparisons, inter-annotator agreement, statistical significance (paired t-test, p<0.01), and the exact procedure used to optimize the activation-based baselines (including hyper-parameters and prompt templates). These clarifications will be added to Section 4.3 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: kurtosis maximization is an independent objective; faithfulness validated externally

full rationale

The paper defines ROTATE as a rotation optimization that directly maximizes a fixed, pre-specified statistical quantity (vocabulary-space kurtosis) chosen because of an external empirical observation about monosemantic neurons. The resulting channels are then evaluated for behavioral faithfulness via separate, non-circular experiments (channel ablation disabling specific activations, head-to-head description quality against activation baselines). No equation or claim reduces the target result to the optimization objective by construction, no self-citation chain bears the central premise, and the motivating kurtosis-monosemantic link is treated as an assumption rather than derived from the method's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method rests on one domain assumption about kurtosis and introduces one new entity (vocabulary channels) whose independent evidence is limited to the reported experiments.

axioms (1)

domain assumption Neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary
This statistical observation is stated as the foundation for the rotation optimization.

invented entities (1)

vocabulary channels no independent evidence
purpose: Sparse, interpretable directions recovered by kurtosis-maximizing rotations of neuron weights
New directions introduced as the output of ROTATE; no external falsifiable prediction is given in the abstract.

pith-pipeline@v0.9.0 · 5492 in / 1279 out tokens · 44665 ms · 2026-05-10T19:16:47.991656+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 2 canonical work pages · 1 internal anchor

[1]

doi: 10.18653/v1/2024.acl-long.841

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.841. URLhttps://aclanthology.org/2024.acl-long.841/. Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, and Mor Geva. Enhancing au- tomated interpretability with output-centric feature descriptions. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar ...

work page doi:10.18653/v1/2024.acl-long.841 2024
[2]

GPT-4 Technical Report

URLhttps://openreview.net/forum?id=aJDykpJAYF. Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. Glitch tokens in large language models: Categorization taxonomy and effective detection.Proc. ACM Softw. Eng., 1(FSE), July 2024. doi: 10.1145/ 3660799. URLhttps://doi.org/10.1145/3660799. Tom L...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3660799 2024
[3]

The activating token context, with the highest-activating token marked**like this**
[4]

It**wasn’t**what I expected at all

Five candidate descriptions: the description of v∗ (correct) and four distractors drawn uniformly at random from channels ofotherneurons in the same model and layer set. The judge selects the description it believes best explains why the neuron fired; we record a hit when it selects the correct description. Example.Below is a sample query for Neuron 9005 ...
[5]

Riding and locomotion contexts (horses, bikes, vehicles)
[6]

Polarity/negation constructions: contractions likewasn’t, didn’t, can’t.[correct]
[7]

Instruction-following and obedience vocabulary
[8]

Technical programming and software development tokens
[9]

The sentence contains ‘wasn’t’, a negation contraction. Description 2 best matches

Temporal markers indicating future scheduling. Judge response:“The sentence contains ‘wasn’t’, a negation contraction. Description 2 best matches.” The four distractor descriptions are sampled from random neurons in Gemma Layer 18. In this example the judge selects Description 2, the correct vocabulary channel. E.4 Patchscopes setup We use the Patchscopes...

2024
[10]

Identifies the common semantic or syntactic themes across channels
[11]

Explains what inputs activate this neuron
[12]

Notes any patterns in token appearance vs prediction
[13]

diverse set of linguistic and semantic features

Is specific enough to be useful but general enough to capture the neuron’s overall function Avoid vague descriptions:Do NOT use generic, uninformative descriptions like “diverse set of linguistic and semantic features” or “various textual patterns”. Be SPECIFIC about what causes {polarity} activations. If the channels are truly diverse, list the 2–3 most ...
[14]

What semantic or syntactic patterns appear in these{polarity}-activation examples?
[15]

How do the LogitLens tokens relate to{polarity}activation patterns?
[16]

diverse set of linguistic and semantic features

Is there a coherent theme? Avoid vague descriptions:Do NOT use generic, uninformative descriptions like “diverse set of linguistic and semantic features” or “various textual patterns”. Be SPECIFIC. If the examples are diverse, list the 2–3 most prominent specific patterns rather than using vague umbrella terms. Please return your answer in JSON format. Fi...
[17]

Identify the common semantic or syntactic theme among these tokens and examples
[18]

Provide a short description of what this channel likely represents or detects
[19]

max activation

The description should be specific but capture the general concept. Please return your answer in JSON format. Figure 18: Prompt used to describe a single vocabulary channel. Each channel is described independently before synthesis into a neuron-level description. Used with Gemini-2.0-Flash. 34 Preprint. Under review. Activating / neutral example generatio...