pith. sign in

arxiv: 2506.10920 · v2 · submitted 2025-06-12 · 💻 cs.CL · cs.LG

Constructing Interpretable Features from Compositional Neuron Groups

Pith reviewed 2026-05-19 09:09 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords mechanistic interpretabilityMLP activationsmatrix factorizationneuron groupscausal steeringconcept representationsfeature decomposition
0
0 comments X

The pith

Decomposing MLP activations with SNMF produces interpretable features that steer language models more effectively than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies semi-nonnegative matrix factorization directly to the activation patterns inside the feed-forward layers of language models. This yields features that each consist of a small group of neurons firing together for particular inputs. The resulting features line up with concepts people can recognize and allow stronger control over model outputs through targeted changes in activation space. Experiments across Llama 3.1, Gemma 2, and GPT-2 show these features beat both sparse autoencoders and a supervised baseline at causal steering tasks. The work also finds that the same small neuron groups appear in multiple related features, revealing reuse and structure in how the model represents concepts.

Core claim

Applying semi-nonnegative matrix factorization to MLP activations learns features that are sparse linear combinations of co-activated neurons. Each such feature can be directly linked to the input tokens that cause the relevant neurons to fire. This yields directions that both align with human-interpretable concepts and exert stronger causal influence on model outputs than those from sparse autoencoders or difference-in-means baselines. The analysis further shows that the same neuron combinations appear in multiple related features, pointing to a compositional hierarchy within the activation space.

What carries the argument

Semi-nonnegative matrix factorization on MLP activation matrices that produces each feature as a sparse non-negative linear combination of co-activated neurons while preserving a direct map to the inputs that trigger them.

If this is right

  • Steering interventions on these features change model behavior more reliably than interventions on SAE directions or difference-in-means vectors.
  • The same small sets of neurons are reused across multiple semantically related features, indicating compositional reuse in concept formation.
  • The activation space of MLP layers contains a detectable hierarchical structure built from shared neuron groups.
  • Each feature comes with an explicit list of input tokens that activate it, allowing direct verification of its meaning without additional post-hoc labeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same factorization could be tried on attention activations to test whether similar compositional groups exist outside the feed-forward path.
  • If the reuse of neuron groups holds across models, it would suggest that language models converge on a shared vocabulary of low-level computational building blocks.
  • Extending the method to track how these groups combine at different layers might reveal how simple features assemble into more abstract concepts during forward passes.

Load-bearing premise

That the co-activation statistics captured by the factorization correspond to the actual computational groupings the model uses to represent and process concepts.

What would settle it

A controlled test in which adding or subtracting the SNMF-derived features produces no reliable change in the model's output probabilities for the corresponding concepts, or where the features fail to match consistent human labels across held-out examples.

read the original abstract

A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a method using semi-nonnegative matrix factorization (SNMF) on MLP activations to identify interpretable features as sparse linear combinations of co-activated neurons. These features are designed to be directly mappable to their activating inputs. The authors report that these SNMF-derived features outperform sparse autoencoders (SAEs) and a difference-in-means baseline in causal steering tasks across Llama 3.1, Gemma 2, and GPT-2, while also aligning with human-interpretable concepts and revealing a hierarchical structure through reused neuron combinations.

Significance. Should the quantitative results and controlled comparisons hold, this work could provide a straightforward unsupervised technique for extracting causally effective features that are intrinsically linked to the model's neuron-level computations, potentially improving upon the limitations of SAEs in both performance and interpretability for mechanistic interpretability research.

major comments (2)
  1. [Abstract] Abstract: The abstract states that SNMF derived features outperform SAEs on causal steering but provides no quantitative numbers, error bars, or details on measurement of steering strength and number of features compared. This absence makes it difficult to assess the practical significance of the claimed improvements and undermines the ability to verify the central experimental claim.
  2. [Abstract] Abstract: SNMF is applied to MLP activations while the abstract notes that SAEs are 'commonly trained over residual stream activations'. This difference in activation space confounds the attribution of superior causal steering performance to the SNMF method itself rather than to operating in the post-nonlinearity MLP space. A comparison with SAEs trained on the same MLP activations is necessary to isolate the contribution of the semi-nonnegative factorization.
minor comments (1)
  1. [Abstract] The abstract mentions that 'specific neuron combinations are reused across semantically-related features' but does not specify how this hierarchical structure was quantified or visualized in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the experimental controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states that SNMF derived features outperform SAEs on causal steering but provides no quantitative numbers, error bars, or details on measurement of steering strength and number of features compared. This absence makes it difficult to assess the practical significance of the claimed improvements and undermines the ability to verify the central experimental claim.

    Authors: We agree that the abstract would be strengthened by including specific quantitative details. In the revised version, we have updated the abstract to report key metrics from our causal steering experiments, including average improvements in steering strength (with standard errors) and the number of features used in each comparison. These values are taken directly from the results in Section 4 and are now summarized concisely within the abstract's length constraints. revision: yes

  2. Referee: [Abstract] Abstract: SNMF is applied to MLP activations while the abstract notes that SAEs are 'commonly trained over residual stream activations'. This difference in activation space confounds the attribution of superior causal steering performance to the SNMF method itself rather than to operating in the post-nonlinearity MLP space. A comparison with SAEs trained on the same MLP activations is necessary to isolate the contribution of the semi-nonnegative factorization.

    Authors: This is a fair point about isolating the method's contribution. While our main results compare against standard SAEs trained on residual stream activations (as is conventional), we have added new controlled experiments training SAEs directly on the same MLP activations. These results, now presented in Section 4.3 and Appendix D, show that SNMF features still outperform the matched SAEs on causal steering. We have also updated the abstract to note this additional control. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of factorization inputs

full rationale

The paper applies semi-nonnegative matrix factorization (SNMF) as an external unsupervised decomposition to MLP activations, producing sparse neuron combinations that are then evaluated on causal steering. The reported outperformance versus SAEs and difference-in-means baselines is obtained from separate experiments on Llama 3.1, Gemma 2, and GPT-2 rather than from any equation that equates the steering metric to the SNMF objective or to parameters fitted on the same data. No self-citation chain, uniqueness theorem, or ansatz is invoked to force the central claim; the factorization objective remains independent of the downstream causal metric. The activation-space difference (MLP vs. residual stream) is a methodological choice open to controlled comparison, not a definitional reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach assumes standard non-negativity constraints and sparsity penalties from matrix factorization literature; the number of components and the sparsity hyperparameter are free choices that must be set before the decomposition can be run.

free parameters (2)
  • number of SNMF components
    The rank of the factorization determines how many features are extracted and is chosen by the experimenter.
  • sparsity regularization strength
    Controls how many neurons participate in each feature and is tuned to produce interpretable groups.
axioms (1)
  • domain assumption MLP activations can be usefully approximated as non-negative linear combinations of a small number of basis vectors
    Invoked when the authors choose semi-nonnegative matrix factorization as the decomposition tool.

pith-pipeline@v0.9.0 · 5784 in / 1377 out tokens · 25350 ms · 2026-05-19T09:09:57.991736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.