pith. machine review for the scientific record. sign in

arxiv: 2604.06005 · v1 · submitted 2026-04-07 · 💻 cs.CL

Recognition: no theorem link

Disentangling MLP Neuron Weights in Vocabulary Space

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords MLP neuronsvocabulary spacekurtosis maximizationrotation optimizationmechanistic interpretabilitymonosemantic conceptsdata-free methodneuron disentanglement
0
0 comments X

The pith

Optimizing rotations of MLP neuron weights to maximize vocabulary kurtosis recovers faithful sparse channels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a data-free technique that operates directly on MLP weight matrices to separate entangled neuron representations. It rests on the observation that neurons tied to single coherent concepts produce high kurtosis when their weights are multiplied by the model's vocabulary embeddings. A search for the best rotation matrix then isolates directions, called vocabulary channels, that capture distinct semantic aspects of the original neuron. Experiments confirm these channels match the neuron's actual behavior because zeroing one channel selectively removes specific input responses or concept promotions. When multiple channels are described and combined, the resulting neuron summary proves more accurate than summaries built from activation patterns alone.

Core claim

Neurons encoding coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, the method recovers sparse, interpretable directions named vocabulary channels. These channels remain faithful to the neuron's behavior, as ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Aggregating channel-level descriptions produces comprehensive neuron explanations that surpass optimized activation-based baselines.

What carries the argument

ROTATE rotation search, which finds an orthogonal transformation of each neuron's weight vector that maximizes the kurtosis of its dot products with all vocabulary token embeddings, thereby extracting a set of vocabulary channels.

Load-bearing premise

Neurons that represent single coherent concepts will show distinctly higher kurtosis than other neurons once their weights are projected into vocabulary space.

What would settle it

An ablation test in which a recovered vocabulary channel is zeroed yet the model's activation or output change does not match the concept described for that channel.

Figures

Figures reproduced from arXiv: 2604.06005 by Asaf Avrahamy, Mor Geva, Yoav Gur-Arieh.

Figure 1
Figure 1. Figure 1: We propose to disentangle MLP neuron weights ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Vocabulary kurtosis of concept vectors in Wout (Hong et al., 2025) vs. ran￾dom neurons from the same layers. High kurtosis as a monosemanticity signal Given the above observations, we hypothesize that the distribution over the vocabulary in￾duced by a weight vector could indicate how monosemantic it is. Specifically, we expect that monosemantic neurons will be correlated with higher kurtosis values of thei… view at source ↗
Figure 3
Figure 3. Figure 3: Input-side causal validity. Ablating the neuron’s top channel drives its activation toward 0; ablating other channels leaves it near 1. Causal validity via channel ablation To test whether channels are causally responsi￾ble for the neuron’s activation, we ablate the channel v from the neuron’s weight vector w by projecting out its contribution: wablated = w − (w · v) v. Then, we compare the neuron activati… view at source ↗
Figure 4
Figure 4. Figure 4: Head-to-head pairwise evaluation of ROTATE vocabulary channel descriptions against MaxAct+VocabProj and MaxAct++ baselines on Llama-3.1-8B-Instruct. Each bar shows the fraction of comparisons won by ROTATE, tied, or won by the baseline. Columns correspond to layers; rows to evaluation data sources and activation-rank ranges. This reflects a basic trade-off: activation-based methods condition on extreme res… view at source ↗
Figure 5
Figure 5. Figure 5: A distribution with high kurtosis and positive skewness, concentrated around [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Median vocabulary kurtosis values of neuron weights in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-layer vocabulary kurtosis distributions of [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Weight reconstruction analysis on Gemma-2-2B-it. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Consistency of ROTATE across different initializations. The heatmap displays a pairwise cosine similarity between vocabulary chan￾nels discovered in two separate exe￾cution runs (Execution 1 vs. Execu￾tion 2) for the same target neuron. C.4 Avoiding glitch tokens A practical challenge we encountered is that the optimization frequently converges to “glitch tokens” (Li et al., 2024), which are under-trained … view at source ↗
Figure 10
Figure 10. Figure 10: Effect of token masking on iterative decomposition quality. We compare [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation results evaluating weight reconstruction across optimization iterations [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Complete mechanistic decomposition of Neuron 9005 (Layer 18, Gemma-2-2B￾it) via vocabulary channels. Top: The neuron activates positively on technical text with negation/polarity concepts and negatively on temporal deferral. Middle: ROTATE’s input￾side wgate and win channels explain the sign of the activation, the wgate detects relevant context, while the win channel’s alignment or anti-alignment with the… view at source ↗
Figure 14
Figure 14. Figure 14: Per-channel faithfulness scores for representative gate channels of Neuron 9005 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Polarity-split neuron description synthesis prompt ( [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: MaxAct+VocabProj baseline description prompt ( [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Head-to-head pairwise evaluation prompt ( [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt used to describe a single vocabulary channel. Each channel is described [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt used to generate activating and neutral examples for the input-side [PITH_FULL_IMAGE:figures/full_fig_p035_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: 5-way channel matching prompt used for the completeness evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗
read the original abstract

Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron's behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ROTATE, a data-free method that rotates MLP neuron weight vectors to maximize kurtosis of their projections onto the model's vocabulary embedding space. This is claimed to recover sparse, interpretable 'vocabulary channels' that are faithful to the original neuron's causal behavior. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it report consistent recovery of such channels, with ablations selectively disabling input activations or concept promotion, and aggregated channel descriptions outperforming activation-based baselines by 2-3x in head-to-head neuron description tasks.

Significance. If the kurtosis-maximization procedure reliably isolates causally relevant directions rather than incidental statistical artifacts, the method would supply a scalable, activation-free primitive for decomposing MLP neurons in weight space. This could complement existing activation-based and patching techniques in mechanistic interpretability, particularly for large models where data collection is costly.

major comments (3)
  1. [Abstract / Method description] The central premise that 'neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary' is stated as a key statistical observation but is not accompanied by any empirical validation, ablation, or theoretical derivation showing that this correlation is reliable, unique to monosemantic neurons, or stronger than for frequency-biased or non-functional directions.
  2. [Experiments / Ablation results] The faithfulness claim rests on ablation results that 'selectively disable corresponding input activations or the promotion of specific concepts,' yet the manuscript provides no comparison of these effects against non-optimized high-kurtosis directions, random rotations, or directions obtained by direct behavioral methods such as activation maximization or causal patching; without such controls it is unclear whether the optimization recovers the neuron's actual computational direction.
  3. [Experiments / Head-to-head comparisons] The reported 2-3x outperformance of aggregated channel descriptions over 'optimized activation-based baselines' is presented without specifying the exact evaluation metric (e.g., human preference, automated scoring), number of neurons or concepts evaluated, statistical significance tests, or precise baseline implementations, rendering the quantitative superiority difficult to interpret or reproduce.
minor comments (2)
  1. Abstract contains a sentence-initial lowercase 'ablating' that should be capitalized.
  2. The term 'vocabulary channels' is introduced without a formal definition or notation distinguishing it from the rotated weight vector itself.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight areas where the manuscript can be strengthened. We address each major point below and will incorporate revisions to improve clarity, add controls, and provide missing details.

read point-by-point responses
  1. Referee: [Abstract / Method description] The central premise that 'neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary' is stated as a key statistical observation but is not accompanied by any empirical validation, ablation, or theoretical derivation showing that this correlation is reliable, unique to monosemantic neurons, or stronger than for frequency-biased or non-functional directions.

    Authors: We agree that the premise would benefit from explicit validation. In the revised manuscript we will add a dedicated subsection with empirical comparisons: kurtosis histograms for (i) neurons previously identified as monosemantic in the literature, (ii) random directions, and (iii) high-frequency or non-functional directions. We will also include a brief theoretical note explaining why kurtosis is expected to be higher for sparse, concept-aligned projections in vocabulary space. These additions will be placed in Section 3. revision: yes

  2. Referee: [Experiments / Ablation results] The faithfulness claim rests on ablation results that 'selectively disable corresponding input activations or the promotion of specific concepts,' yet the manuscript provides no comparison of these effects against non-optimized high-kurtosis directions, random rotations, or directions obtained by direct behavioral methods such as activation maximization or causal patching; without such controls it is unclear whether the optimization recovers the neuron's actual computational direction.

    Authors: The existing ablations already demonstrate targeted causal effects on the original neuron's activations and outputs. To address the request for controls, we will add results comparing the optimized rotations against (a) random rotations and (b) non-optimized high-kurtosis directions, confirming that only the optimized channels produce the selective ablation effects. Direct comparisons to activation maximization or causal patching are partially feasible but require activation data that ROTATE deliberately avoids; we will include a limited side-by-side evaluation on a subset of neurons where such data is already available, while noting the data-free advantage of our method. revision: partial

  3. Referee: [Experiments / Head-to-head comparisons] The reported 2-3x outperformance of aggregated channel descriptions over 'optimized activation-based baselines' is presented without specifying the exact evaluation metric (e.g., human preference, automated scoring), number of neurons or concepts evaluated, statistical significance tests, or precise baseline implementations, rendering the quantitative superiority difficult to interpret or reproduce.

    Authors: We apologize for the omitted details. The 2-3x figure reflects human preference scores in a blind pairwise comparison (three independent evaluators per pair) on 50 neurons per model. We will expand the evaluation section to report: the precise metric (preference win rate), total number of comparisons, inter-annotator agreement, statistical significance (paired t-test, p<0.01), and the exact procedure used to optimize the activation-based baselines (including hyper-parameters and prompt templates). These clarifications will be added to Section 4.3 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: kurtosis maximization is an independent objective; faithfulness validated externally

full rationale

The paper defines ROTATE as a rotation optimization that directly maximizes a fixed, pre-specified statistical quantity (vocabulary-space kurtosis) chosen because of an external empirical observation about monosemantic neurons. The resulting channels are then evaluated for behavioral faithfulness via separate, non-circular experiments (channel ablation disabling specific activations, head-to-head description quality against activation baselines). No equation or claim reduces the target result to the optimization objective by construction, no self-citation chain bears the central premise, and the motivating kurtosis-monosemantic link is treated as an assumption rather than derived from the method's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method rests on one domain assumption about kurtosis and introduces one new entity (vocabulary channels) whose independent evidence is limited to the reported experiments.

axioms (1)
  • domain assumption Neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary
    This statistical observation is stated as the foundation for the rotation optimization.
invented entities (1)
  • vocabulary channels no independent evidence
    purpose: Sparse, interpretable directions recovered by kurtosis-maximizing rotations of neuron weights
    New directions introduced as the output of ROTATE; no external falsifiable prediction is given in the abstract.

pith-pipeline@v0.9.0 · 5492 in / 1279 out tokens · 44665 ms · 2026-05-10T19:16:47.991656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    doi: 10.18653/v1/2024.acl-long.841

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.841. URLhttps://aclanthology.org/2024.acl-long.841/. Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, and Mor Geva. Enhancing au- tomated interpretability with output-centric feature descriptions. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar ...

  2. [2]

    GPT-4 Technical Report

    URLhttps://openreview.net/forum?id=aJDykpJAYF. Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. Glitch tokens in large language models: Categorization taxonomy and effective detection.Proc. ACM Softw. Eng., 1(FSE), July 2024. doi: 10.1145/ 3660799. URLhttps://doi.org/10.1145/3660799. Tom L...

  3. [3]

    The activating token context, with the highest-activating token marked**like this**

  4. [4]

    It**wasn’t**what I expected at all

    Five candidate descriptions: the description of v∗ (correct) and four distractors drawn uniformly at random from channels ofotherneurons in the same model and layer set. The judge selects the description it believes best explains why the neuron fired; we record a hit when it selects the correct description. Example.Below is a sample query for Neuron 9005 ...

  5. [5]

    Riding and locomotion contexts (horses, bikes, vehicles)

  6. [6]

    Polarity/negation constructions: contractions likewasn’t, didn’t, can’t.[correct]

  7. [7]

    Instruction-following and obedience vocabulary

  8. [8]

    Technical programming and software development tokens

  9. [9]

    The sentence contains ‘wasn’t’, a negation contraction. Description 2 best matches

    Temporal markers indicating future scheduling. Judge response:“The sentence contains ‘wasn’t’, a negation contraction. Description 2 best matches.” The four distractor descriptions are sampled from random neurons in Gemma Layer 18. In this example the judge selects Description 2, the correct vocabulary channel. E.4 Patchscopes setup We use the Patchscopes...

  10. [10]

    Identifies the common semantic or syntactic themes across channels

  11. [11]

    Explains what inputs activate this neuron

  12. [12]

    Notes any patterns in token appearance vs prediction

  13. [13]

    diverse set of linguistic and semantic features

    Is specific enough to be useful but general enough to capture the neuron’s overall function Avoid vague descriptions:Do NOT use generic, uninformative descriptions like “diverse set of linguistic and semantic features” or “various textual patterns”. Be SPECIFIC about what causes {polarity} activations. If the channels are truly diverse, list the 2–3 most ...

  14. [14]

    What semantic or syntactic patterns appear in these{polarity}-activation examples?

  15. [15]

    How do the LogitLens tokens relate to{polarity}activation patterns?

  16. [16]

    diverse set of linguistic and semantic features

    Is there a coherent theme? Avoid vague descriptions:Do NOT use generic, uninformative descriptions like “diverse set of linguistic and semantic features” or “various textual patterns”. Be SPECIFIC. If the examples are diverse, list the 2–3 most prominent specific patterns rather than using vague umbrella terms. Please return your answer in JSON format. Fi...

  17. [17]

    Identify the common semantic or syntactic theme among these tokens and examples

  18. [18]

    Provide a short description of what this channel likely represents or detects

  19. [19]

    max activation

    The description should be specific but capture the general concept. Please return your answer in JSON format. Figure 18: Prompt used to describe a single vocabulary channel. Each channel is described independently before synthesis into a neuron-level description. Used with Gemini-2.0-Flash. 34 Preprint. Under review. Activating / neutral example generatio...