StructLens: A Structural Lens for Language Models via Maximum Spanning Trees
Pith reviewed 2026-05-21 13:39 UTC · model grok-4.3
The pith
Language models organize token representations into tree structures that are strongest in middle layers and develop from small to large units during pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StructLens constructs maximum spanning trees from semantic representations in residual streams, inspired by dependency parsing trees, to summarize token relationships. Analysis of these trees demonstrates that middle layers exhibit the strongest local-span organization where contiguous tokens remain nearby in representation space. Examination of pre-training checkpoints further shows smaller local units become detectable earlier while larger units appear at later training stages.
What carries the argument
Maximum spanning trees built from token similarities in residual stream representations, serving as summaries of structural token relationships across layers.
If this is right
- Local token spans organize most coherently in middle layers rather than early or final ones.
- Pre-training proceeds by first detecting small local units before assembling larger structures.
- Representation organization can be tracked holistically without relying solely on attention patterns.
- Models exhibit progressive structural development that mirrors aspects of language acquisition.
Where Pith is reading between the lines
- The same tree-construction approach could be applied to compare structural maturation across model scales or architectures.
- These trees may align with human-annotated dependency or semantic parses, providing a bridge to linguistic theory.
- If the layer and training patterns hold, interventions that strengthen middle-layer trees could improve downstream parsing or generation tasks.
Load-bearing premise
That similarity between residual stream representations directly encodes meaningful semantic or structural token relationships rather than superficial metric artifacts.
What would settle it
Running an independent clustering or parsing evaluation on the same token sets and checking whether the maximum spanning trees recover gold syntactic or semantic groupings at rates significantly above random baselines.
read the original abstract
Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest their own internal structures as well. While interpretability research has investigated how models compute representations mechanistically through attention patterns and Sparse AutoEncoders, the organization of the resulting representations is overlooked. To address this gap, we introduce StructLens, a framework to analyze representations through a holistic structural view. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, inspired by tree representation in dependency parsing, and provides summaries of token relationships in representation space. We analyze how contiguous tokens are also nearby in representation space and find that middle layers show the strongest local-span organization. Moreover, analysis of pre-training checkpoints reveals that smaller local units become detectable earlier in pre-training, and larger units later. Our findings demonstrate that StructLens provides insights into how models organize token representations across layers and training. Our code is available at https://github.com/naist-nlp/structlens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces StructLens, a framework that builds maximum spanning trees from pairwise cosine similarities in language model residual streams to summarize token relationships. It reports observational findings that contiguous tokens are more likely to be directly connected in these trees, with the strongest local-span organization appearing in middle layers, and that during pre-training smaller contiguous units become detectable earlier while larger units appear later. The approach is motivated by dependency parsing and is offered as a holistic complement to attention or SAE-based interpretability, with code released at the provided GitHub link.
Significance. If the reported layer-wise and checkpoint-wise patterns survive appropriate geometric controls, the work supplies a simple, tree-based descriptive lens for tracking how local structure organizes in representation space. The public code release is a concrete strength that would allow direct replication and extension to other models or training regimes.
major comments (2)
- [Method and §4.1] Method and §4.1 (layer-wise contiguous-token analysis): the claim that middle layers exhibit the strongest local-span organization is based on the fraction of MST edges connecting contiguous tokens, yet no baseline is reported that compares against token-permuted residual streams at the same layer or against isotropic random vectors drawn from the observed per-layer norm distribution. Without these controls the increase in contiguous edges could be an artifact of high-dimensional concentration or norm growth rather than evidence of learned linguistic structure; this directly affects the central layer-wise claim.
- [§5] §5 (pre-training checkpoint analysis): the finding that smaller local units appear earlier and larger units later likewise lacks the same permuted-token or random-vector baselines at each checkpoint. Because the MST is constructed from raw residual similarities, the temporal ordering of unit sizes could reflect changing norm or dimensionality statistics rather than the emergence of linguistic structure; this is load-bearing for the training-dynamics conclusion.
minor comments (2)
- [Abstract] The abstract states that 'middle layers show the strongest local-span organization' without defining the quantitative metric (e.g., fraction of contiguous edges, average span length) or reporting error bars across models or seeds.
- [Method] Notation for the similarity measure and the precise MST algorithm (Kruskal vs. Prim, handling of ties) is not stated explicitly in the method section, making exact reproduction dependent on inspecting the released code.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. The points raised about the need for geometric controls are important for strengthening the interpretation of our observational findings. We address each major comment below and describe the planned revisions.
read point-by-point responses
-
Referee: [Method and §4.1] Method and §4.1 (layer-wise contiguous-token analysis): the claim that middle layers exhibit the strongest local-span organization is based on the fraction of MST edges connecting contiguous tokens, yet no baseline is reported that compares against token-permuted residual streams at the same layer or against isotropic random vectors drawn from the observed per-layer norm distribution. Without these controls the increase in contiguous edges could be an artifact of high-dimensional concentration or norm growth rather than evidence of learned linguistic structure; this directly affects the central layer-wise claim.
Authors: We agree that the reported increase in contiguous-token edges in middle layers could potentially arise from geometric properties of the residual streams rather than learned structure. In the revised manuscript we will add the suggested controls: MSTs computed on token-permuted versions of the residual streams at each layer, and MSTs on isotropic random vectors sampled to match the observed per-layer norm distribution. These baseline results will be presented alongside the original analysis in an updated §4.1, with explicit discussion of how they affect the layer-wise claims. revision: yes
-
Referee: [§5] §5 (pre-training checkpoint analysis): the finding that smaller local units appear earlier and larger units later likewise lacks the same permuted-token or random-vector baselines at each checkpoint. Because the MST is constructed from raw residual similarities, the temporal ordering of unit sizes could reflect changing norm or dimensionality statistics rather than the emergence of linguistic structure; this is load-bearing for the training-dynamics conclusion.
Authors: We concur that the same controls are required for the pre-training dynamics analysis to rule out confounds from evolving representation statistics. We will extend §5 to include, at each checkpoint, both token-permuted residual streams and random vectors matched to the checkpoint-specific norm distribution. The revised section will report these baselines and reassess whether the observed progression from smaller to larger contiguous units survives the controls. revision: yes
Circularity Check
No circularity: purely descriptive framework with independent empirical observations
full rationale
The paper introduces StructLens to construct maximum spanning trees from pairwise similarities in residual-stream representations and then reports direct measurements of contiguous-token edges in those trees. The central claims (strongest local organization in middle layers; smaller units detectable earlier in pre-training) are obtained by inspecting the resulting MST edge sets rather than by fitting any parameters to a subset of data and then claiming a prediction of the same quantities. No self-definitional loops, load-bearing self-citations, imported uniqueness theorems, or smuggled ansatzes appear in the derivation; the tree construction is a fixed algorithmic summary whose statistical properties are measured afterward. The analysis is therefore self-contained as an observational tool without reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Residual stream representations encode semantic relationships that can be meaningfully summarized by tree structures.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
StructLens constructs maximum spanning trees based on the semantic representations in residual streams... contiguous subtree ratio... islands
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use reciprocal as a way to convert distance into similarity... g(h_i, h_j) = 1 / (1 + ||h_i - h_j||)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.