Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Michael Hahn; Xinting Huang

arxiv: 2508.01916 · v3 · pith:KJYOSSX4new · submitted 2025-08-03 · 💻 cs.LG · cs.AI· cs.CL

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Xinting Huang , Michael Hahn This is my paper

Pith reviewed 2026-05-19 00:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords mechanistic interpretabilityrepresentation subspacesunsupervised learningneural network circuitsGPT-2model internals

0 comments

The pith

Neighbor distance minimization finds interpretable subspaces encoding consistent concepts in model representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that representation spaces in neural networks can be broken down into subspaces each tied to a single abstract concept using only an unsupervised training objective. This is done by neighbor distance minimization, which projects the space so that inputs close in the original representation stay close in the subspace. A reader would care if this holds because it implies models use modular, variable-like structures internally that can be discovered automatically. Experiments link these subspaces to circuit variables in GPT-2 and show separation of context and knowledge in larger models up to 2 billion parameters. This provides a new unsupervised lens on how models organize their computations.

Core claim

By optimizing a projection to minimize distances between neighboring points, the method recovers non-basis-aligned subspaces in which the information encoded tends to correspond to the same abstract concept for different inputs, resembling variables in the model's internal circuits.

What carries the argument

Neighbor distance minimization (NDM), an unsupervised objective that learns subspaces by reducing distances among neighboring inputs in the projected representation.

If this is right

Subspaces recovered match circuit variables in GPT-2.
The approach works on models with 2 billion parameters.
Distinct subspaces handle context routing versus parametric knowledge.
Representation spaces contain natural, concept-specific organizations discoverable without labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If subspaces consistently encode single concepts, targeted interventions on them could modify specific model behaviors.
This unsupervised decomposition might apply to other domains like vision or reinforcement learning models.
Consistency across inputs suggests models maintain stable internal variables for abstract ideas.

Load-bearing premise

That a projection minimizing neighbor distances will isolate single-concept encodings consistently rather than arbitrary groupings.

What would settle it

If the obtained subspaces do not consistently encode the same abstract concept across inputs or do not correspond to known circuit variables when tested on GPT-2.

read the original abstract

Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NDM gives a workable unsupervised route to concept-like subspaces but the tie to actual model circuits rests on correlations that could have other explanations.

read the letter

The main point is that neighbor distance minimization can produce subspaces that look interpretable and somewhat aligned with known circuits in GPT-2, plus it scales at least to the 2B regime for separating context and knowledge routing. That is the concrete advance here. The method itself is new in how it uses an unrelated neighbor-distance objective to recover non-basis-aligned directions that hold consistent abstract information across inputs. The qualitative examples are straightforward and the quantitative checks against GPT-2 circuits give the claim some grounding. Scaling evidence to larger models is also a plus, since most subspace work stays small. The soft spot is exactly the one the stress test flags: nothing in the experiments holds the input distribution fixed while changing internal computation, so the subspaces could be recovering input-similarity clusters that happen to line up with circuits after the fact. The GPT-2 numbers are consistent with either story, and without those controls the causal link to model variables stays suggestive rather than tight. Minor gaps include missing ablations on projection dimensionality and clearer error bars on the quantitative metrics. This is for people already working on mechanistic interpretability who need fresh unsupervised decomposition tools. A reader who wants practical ideas for breaking down large activation spaces will get usable techniques even if the circuit interpretation needs tightening. I would send it to peer review; the core idea is solid enough to benefit from referee input on the experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes Neighbor Distance Minimization (NDM), an unsupervised objective that learns non-basis-aligned subspaces in neural network representation spaces. It claims these subspaces are interpretable, encode consistent abstract concepts across inputs (functioning like circuit variables), show a strong quantitative connection to known circuits in GPT-2, and scale to separate context versus parametric-knowledge routing in 2B-parameter models.

Significance. If the central claim holds after addressing the noted gaps, the work would supply a scalable unsupervised tool for mechanistic interpretability that does not require labeled circuit knowledge. The reported scalability evidence to 2B models and the unsupervised framing are genuine strengths that could shift how subspaces are discovered in large models.

major comments (2)

[Quantitative experiments on GPT-2] Quantitative experiments section: the claimed strong connection between NDM subspaces and GPT-2 circuit variables is consistent with either model-internal encoding or input-similarity clustering; an ablation that holds the input distribution fixed while altering internal computation (e.g., via activation patching or counterfactual inputs) is required to support the circuit-variable interpretation.
[Scalability to 2B models] Scalability experiments: the separation of context and parametric-knowledge subspaces in the 2B model lacks reported quantitative metrics (e.g., mutual information with known routing heads or intervention success rates) and controls for input-driven effects, weakening the scalability claim.

minor comments (2)

The method description would benefit from an explicit statement of the NDM loss function and the precise definition of 'neighbor' (e.g., in embedding space or activation space) to allow reproduction.
Figure captions should explicitly state the dimensionality of the learned subspaces and the visualization technique used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important distinctions between input-driven effects and model-internal representations as well as the need for stronger quantitative support. We address each major comment below and will incorporate revisions to strengthen the manuscript accordingly.

read point-by-point responses

Referee: Quantitative experiments section: the claimed strong connection between NDM subspaces and GPT-2 circuit variables is consistent with either model-internal encoding or input-similarity clustering; an ablation that holds the input distribution fixed while altering internal computation (e.g., via activation patching or counterfactual inputs) is required to support the circuit-variable interpretation.

Authors: We appreciate the referee's observation that the current quantitative results on GPT-2 could be consistent with input-similarity clustering. Our experiments demonstrate alignment between NDM subspaces and known circuit variables, with subspaces encoding consistent abstract concepts across inputs. To more rigorously support the circuit-variable interpretation over input-driven alternatives, we will add ablations in the revised manuscript that hold the input distribution fixed. These will include activation patching and counterfactual input experiments on the known GPT-2 circuits, with results showing the effect on subspace recovery and alignment metrics. revision: yes
Referee: Scalability experiments: the separation of context and parametric-knowledge subspaces in the 2B model lacks reported quantitative metrics (e.g., mutual information with known routing heads or intervention success rates) and controls for input-driven effects, weakening the scalability claim.

Authors: We acknowledge that the scalability section relies primarily on qualitative evidence of subspace separation for context versus parametric knowledge routing. To address this, the revised manuscript will include quantitative metrics such as mutual information with known routing heads and intervention success rates. We will also add controls that compare NDM subspaces against input-similarity clustering baselines to mitigate concerns about input-driven effects. These additions will provide a more robust foundation for the scalability claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; unsupervised NDM derivation is self-contained

full rationale

The paper presents NDM as an unsupervised objective that minimizes neighbor distances to decompose representation space into subspaces. Interpretability and correspondence to circuit variables are assessed after training via qualitative inspection and separate quantitative checks against known GPT-2 circuits. No equations or steps reduce the claimed subspaces to fitted parameters or self-citations by construction; the training objective does not incorporate circuit labels or prior results from the same authors in a load-bearing way. The central claim therefore rests on independent empirical outcomes rather than definitional equivalence to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The core assumption that natural subspaces exist and can be recovered unsupervised is treated as a domain assumption.

axioms (1)

domain assumption Representation spaces contain organized subspaces encoding distinct abstract concepts that can be isolated without supervision.
Implicit in the claim that NDM finds 'natural' subspaces similar to model variables.

pith-pipeline@v0.9.0 · 5708 in / 1226 out tokens · 25939 ms · 2026-05-19T00:49:49.908428+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner... minimizing the distance with the nearest neighbor in subspaces.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

subspaces are as independent as possible... total correlation is reduced

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.