Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
Pith reviewed 2026-05-19 00:49 UTC · model grok-4.3
The pith
Neighbor distance minimization finds interpretable subspaces encoding consistent concepts in model representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By optimizing a projection to minimize distances between neighboring points, the method recovers non-basis-aligned subspaces in which the information encoded tends to correspond to the same abstract concept for different inputs, resembling variables in the model's internal circuits.
What carries the argument
Neighbor distance minimization (NDM), an unsupervised objective that learns subspaces by reducing distances among neighboring inputs in the projected representation.
If this is right
- Subspaces recovered match circuit variables in GPT-2.
- The approach works on models with 2 billion parameters.
- Distinct subspaces handle context routing versus parametric knowledge.
- Representation spaces contain natural, concept-specific organizations discoverable without labels.
Where Pith is reading between the lines
- If subspaces consistently encode single concepts, targeted interventions on them could modify specific model behaviors.
- This unsupervised decomposition might apply to other domains like vision or reinforcement learning models.
- Consistency across inputs suggests models maintain stable internal variables for abstract ideas.
Load-bearing premise
That a projection minimizing neighbor distances will isolate single-concept encodings consistently rather than arbitrary groupings.
What would settle it
If the obtained subspaces do not consistently encode the same abstract concept across inputs or do not correspond to known circuit variables when tested on GPT-2.
read the original abstract
Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Neighbor Distance Minimization (NDM), an unsupervised objective that learns non-basis-aligned subspaces in neural network representation spaces. It claims these subspaces are interpretable, encode consistent abstract concepts across inputs (functioning like circuit variables), show a strong quantitative connection to known circuits in GPT-2, and scale to separate context versus parametric-knowledge routing in 2B-parameter models.
Significance. If the central claim holds after addressing the noted gaps, the work would supply a scalable unsupervised tool for mechanistic interpretability that does not require labeled circuit knowledge. The reported scalability evidence to 2B models and the unsupervised framing are genuine strengths that could shift how subspaces are discovered in large models.
major comments (2)
- [Quantitative experiments on GPT-2] Quantitative experiments section: the claimed strong connection between NDM subspaces and GPT-2 circuit variables is consistent with either model-internal encoding or input-similarity clustering; an ablation that holds the input distribution fixed while altering internal computation (e.g., via activation patching or counterfactual inputs) is required to support the circuit-variable interpretation.
- [Scalability to 2B models] Scalability experiments: the separation of context and parametric-knowledge subspaces in the 2B model lacks reported quantitative metrics (e.g., mutual information with known routing heads or intervention success rates) and controls for input-driven effects, weakening the scalability claim.
minor comments (2)
- The method description would benefit from an explicit statement of the NDM loss function and the precise definition of 'neighbor' (e.g., in embedding space or activation space) to allow reproduction.
- Figure captions should explicitly state the dimensionality of the learned subspaces and the visualization technique used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important distinctions between input-driven effects and model-internal representations as well as the need for stronger quantitative support. We address each major comment below and will incorporate revisions to strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: Quantitative experiments section: the claimed strong connection between NDM subspaces and GPT-2 circuit variables is consistent with either model-internal encoding or input-similarity clustering; an ablation that holds the input distribution fixed while altering internal computation (e.g., via activation patching or counterfactual inputs) is required to support the circuit-variable interpretation.
Authors: We appreciate the referee's observation that the current quantitative results on GPT-2 could be consistent with input-similarity clustering. Our experiments demonstrate alignment between NDM subspaces and known circuit variables, with subspaces encoding consistent abstract concepts across inputs. To more rigorously support the circuit-variable interpretation over input-driven alternatives, we will add ablations in the revised manuscript that hold the input distribution fixed. These will include activation patching and counterfactual input experiments on the known GPT-2 circuits, with results showing the effect on subspace recovery and alignment metrics. revision: yes
-
Referee: Scalability experiments: the separation of context and parametric-knowledge subspaces in the 2B model lacks reported quantitative metrics (e.g., mutual information with known routing heads or intervention success rates) and controls for input-driven effects, weakening the scalability claim.
Authors: We acknowledge that the scalability section relies primarily on qualitative evidence of subspace separation for context versus parametric knowledge routing. To address this, the revised manuscript will include quantitative metrics such as mutual information with known routing heads and intervention success rates. We will also add controls that compare NDM subspaces against input-similarity clustering baselines to mitigate concerns about input-driven effects. These additions will provide a more robust foundation for the scalability claim. revision: yes
Circularity Check
No significant circularity; unsupervised NDM derivation is self-contained
full rationale
The paper presents NDM as an unsupervised objective that minimizes neighbor distances to decompose representation space into subspaces. Interpretability and correspondence to circuit variables are assessed after training via qualitative inspection and separate quantitative checks against known GPT-2 circuits. No equations or steps reduce the claimed subspaces to fitted parameters or self-citations by construction; the training objective does not incorporate circuit labels or prior results from the same authors in a load-bearing way. The central claim therefore rests on independent empirical outcomes rather than definitional equivalence to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Representation spaces contain organized subspaces encoding distinct abstract concepts that can be isolated without supervision.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner... minimizing the distance with the nearest neighbor in subspaces.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
subspaces are as independent as possible... total correlation is reduced
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.