Constructing Interpretable Features from Compositional Neuron Groups
Pith reviewed 2026-05-19 09:09 UTC · model grok-4.3
The pith
Decomposing MLP activations with SNMF produces interpretable features that steer language models more effectively than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying semi-nonnegative matrix factorization to MLP activations learns features that are sparse linear combinations of co-activated neurons. Each such feature can be directly linked to the input tokens that cause the relevant neurons to fire. This yields directions that both align with human-interpretable concepts and exert stronger causal influence on model outputs than those from sparse autoencoders or difference-in-means baselines. The analysis further shows that the same neuron combinations appear in multiple related features, pointing to a compositional hierarchy within the activation space.
What carries the argument
Semi-nonnegative matrix factorization on MLP activation matrices that produces each feature as a sparse non-negative linear combination of co-activated neurons while preserving a direct map to the inputs that trigger them.
If this is right
- Steering interventions on these features change model behavior more reliably than interventions on SAE directions or difference-in-means vectors.
- The same small sets of neurons are reused across multiple semantically related features, indicating compositional reuse in concept formation.
- The activation space of MLP layers contains a detectable hierarchical structure built from shared neuron groups.
- Each feature comes with an explicit list of input tokens that activate it, allowing direct verification of its meaning without additional post-hoc labeling.
Where Pith is reading between the lines
- The same factorization could be tried on attention activations to test whether similar compositional groups exist outside the feed-forward path.
- If the reuse of neuron groups holds across models, it would suggest that language models converge on a shared vocabulary of low-level computational building blocks.
- Extending the method to track how these groups combine at different layers might reveal how simple features assemble into more abstract concepts during forward passes.
Load-bearing premise
That the co-activation statistics captured by the factorization correspond to the actual computational groupings the model uses to represent and process concepts.
What would settle it
A controlled test in which adding or subtracting the SNMF-derived features produces no reliable change in the model's output probabilities for the corresponding concepts, or where the features fail to match consistent human labels across held-out examples.
read the original abstract
A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a method using semi-nonnegative matrix factorization (SNMF) on MLP activations to identify interpretable features as sparse linear combinations of co-activated neurons. These features are designed to be directly mappable to their activating inputs. The authors report that these SNMF-derived features outperform sparse autoencoders (SAEs) and a difference-in-means baseline in causal steering tasks across Llama 3.1, Gemma 2, and GPT-2, while also aligning with human-interpretable concepts and revealing a hierarchical structure through reused neuron combinations.
Significance. Should the quantitative results and controlled comparisons hold, this work could provide a straightforward unsupervised technique for extracting causally effective features that are intrinsically linked to the model's neuron-level computations, potentially improving upon the limitations of SAEs in both performance and interpretability for mechanistic interpretability research.
major comments (2)
- [Abstract] Abstract: The abstract states that SNMF derived features outperform SAEs on causal steering but provides no quantitative numbers, error bars, or details on measurement of steering strength and number of features compared. This absence makes it difficult to assess the practical significance of the claimed improvements and undermines the ability to verify the central experimental claim.
- [Abstract] Abstract: SNMF is applied to MLP activations while the abstract notes that SAEs are 'commonly trained over residual stream activations'. This difference in activation space confounds the attribution of superior causal steering performance to the SNMF method itself rather than to operating in the post-nonlinearity MLP space. A comparison with SAEs trained on the same MLP activations is necessary to isolate the contribution of the semi-nonnegative factorization.
minor comments (1)
- [Abstract] The abstract mentions that 'specific neuron combinations are reused across semantically-related features' but does not specify how this hierarchical structure was quantified or visualized in the experiments.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the experimental controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states that SNMF derived features outperform SAEs on causal steering but provides no quantitative numbers, error bars, or details on measurement of steering strength and number of features compared. This absence makes it difficult to assess the practical significance of the claimed improvements and undermines the ability to verify the central experimental claim.
Authors: We agree that the abstract would be strengthened by including specific quantitative details. In the revised version, we have updated the abstract to report key metrics from our causal steering experiments, including average improvements in steering strength (with standard errors) and the number of features used in each comparison. These values are taken directly from the results in Section 4 and are now summarized concisely within the abstract's length constraints. revision: yes
-
Referee: [Abstract] Abstract: SNMF is applied to MLP activations while the abstract notes that SAEs are 'commonly trained over residual stream activations'. This difference in activation space confounds the attribution of superior causal steering performance to the SNMF method itself rather than to operating in the post-nonlinearity MLP space. A comparison with SAEs trained on the same MLP activations is necessary to isolate the contribution of the semi-nonnegative factorization.
Authors: This is a fair point about isolating the method's contribution. While our main results compare against standard SAEs trained on residual stream activations (as is conventional), we have added new controlled experiments training SAEs directly on the same MLP activations. These results, now presented in Section 4.3 and Appendix D, show that SNMF features still outperform the matched SAEs on causal steering. We have also updated the abstract to note this additional control. revision: yes
Circularity Check
No significant circularity; empirical results independent of factorization inputs
full rationale
The paper applies semi-nonnegative matrix factorization (SNMF) as an external unsupervised decomposition to MLP activations, producing sparse neuron combinations that are then evaluated on causal steering. The reported outperformance versus SAEs and difference-in-means baselines is obtained from separate experiments on Llama 3.1, Gemma 2, and GPT-2 rather than from any equation that equates the steering metric to the SNMF objective or to parameters fitted on the same data. No self-citation chain, uniqueness theorem, or ansatz is invoked to force the central claim; the factorization objective remains independent of the downstream causal metric. The activation-space difference (MLP vs. residual stream) is a methodological choice open to controlled comparison, not a definitional reduction.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of SNMF components
- sparsity regularization strength
axioms (1)
- domain assumption MLP activations can be usefully approximated as non-negative linear combinations of a small number of basis vectors
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.