Scale Determines Whether Language Models Organize Representation Geometry for Prediction

Weilun Xu

arxiv: 2605.17084 · v1 · pith:6QJWXCMAnew · submitted 2026-05-16 · 💻 cs.LG · cs.CL

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

Weilun Xu This is my paper

Pith reviewed 2026-05-20 15:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords language modelsrepresentation geometryscale dependenceunembedding matrixsubspace alignmentprediction organizationmodel capacitylayer-wise analysis

0 comments

The pith

Scale determines whether language models organize their representation geometry specifically for prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Subspace PGA, a metric that checks whether distances in a model's layer representations line up with the readout directions of the unembedding matrix more than they do with random directions of the same size. It shows that this alignment with prediction is strong in middle layers across models of different sizes, yet only larger models keep the alignment through later layers and full training. Smaller models gradually lose the alignment in late layers even while their loss continues to drop. The authors link the difference to a capacity effect in which a handful of strong directions drift away from the readout subspace, covering up the predictive structure rather than erasing it, so that excising those directions brings the alignment back.

Core claim

Intermediate geometry in language models is significantly organized for prediction, with peak statistical alignment z-scores of 9 to 24 to the readout subspace of the unembedding matrix W_U, but the organization is scale-dependent: models with dimension 1024 or smaller progressively lose this alignment at late layers during training even as loss improves, whereas models with dimension 2048 or larger preserve the alignment throughout; a few dominant directions migrate away from W_U's readout in the smaller models, masking rather than destroying the structure beneath, and excising them restores the alignment.

What carries the argument

Subspace PGA, a metric that tests whether a layer's distance structure aligns with the readout subspace of the unembedding matrix W_U more than with random subspaces of equal size.

If this is right

Small models lose predictive organization in late layers even while overall loss keeps improving.
Large models maintain predictive geometry organization across all layers during training.
Removing a small number of dominant directions restores alignment between representation distances and the unembedding readout in small models.
Standard spectral metrics and loss curves do not capture the scale-dependent difference in how geometry is organized for prediction.
Predictive organization peaks in intermediate layers and reaches high statistical significance relative to random baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The capacity trade-off may help explain why scaling improves certain generalization behaviors beyond raw accuracy.
Interventions that realign or suppress the drifting directions could be tested for their effect on small-model task performance.
Representation diagnostics for prediction may need to supplement loss and spectral measures with subspace alignment checks.

Load-bearing premise

The Subspace PGA metric correctly identifies geometry organized for prediction by comparing its distance structure to the unembedding readout subspace versus random subspaces.

What would settle it

If excising the dominant directions that have moved away from the W_U readout in small models does not restore measurable alignment with that readout subspace, the masking account would be falsified.

Figures

Figures reproduced from arXiv: 2605.17084 by Weilun Xu.

**Figure 1.** Figure 1: Subspace PGA: Scale Determines Predictive Organization. (a) Method: SVD of the unembedding matrix WU defines a readout subspace (top-k right singular vectors). We project hidden states onto this subspace and onto 100 random subspaces of equal dimensionality to form a null distribution, then compare how well each projection preserves the full-space distance structure. The z-score quantifies how much more ge… view at source ↗

**Figure 2.** Figure 2: Scale Determines Predictive Organization. (a) Subspace PGA z-score vs. relative depth. Small models (warm) lose predictive organization at late layers before recovering at the final layer; large models (cool) maintain it throughout. (b) Minimum z-score across the suite: a sharp transition at d ≈ 2048. (c) Under CCR-5/10, all negative z-scores become positive for d ≥ 768—predictive structure is intact but m… view at source ↗

**Figure 3.** Figure 3: PC1 Migrates Bright → Dark as Collapse Emerges (Pythia-410M). (a) PC1’s dark-subspace overlap increases at late layers during training; the final layer stays anchored. (b) Effective rank drops as anisotropy concentrates on dark directions. (c) z-score goes negative when PC1→dark > 0.80 and effective rank < 100. Mechanism. What does collapse look like at the level of correlations? In the affected layers, th… view at source ↗

**Figure 4.** Figure 4: Masking is learned and scale-dependent. z-score across training steps and layers for (a) 160M, (b) 410M, (c) 1B. Small models develop late-layer masking by step ∼96k; 1B never does. (d) Validation loss converges comparably—masking does not affect the training objective. from surrounding layers while z swings by more than 40 standard deviations. Two layers can have the same shape and be doing very different… view at source ↗

**Figure 5.** Figure 5: Convergence and Probing. (a–c) Cross-model RSA: readout RSA (red) exceeds random null (blue) and sometimes full-space RSA (grey). (d) 410M: probing accuracy stays high in the collapse zone—masking, not erasure. (e) 1B: z-scores and probing accuracy correlate (ρ = 0.69, p = 0.002). The readout subspace does not merely inherit cross-model agreement; it concentrates it (Figure 5a–c). The 100-dimensional reado… view at source ↗

**Figure 6.** Figure 6: Cumulative variance explained by top-k right singular vectors of WU. Small models (warm colors) have steeply decaying spectra; large models (cool colors) have flatter spectra. Dashed line marks k=100, the default readout subspace dimension. C.5 Absolute Correlation Values Per Layer [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Logit Lens vs Subspace PGA. (a) Absolute decodability fails universally at intermediate depths. (b) Intrinsic geometric organization collapses only in small-capacity models. The dissociation confirms that geometric organization for prediction and absolute decodability are distinct properties. D.4 Spectral Metrics Are Blind to Predictive Organization Spectral metrics (e.g., α-ReQ, RankMe) characterize the s… view at source ↗

**Figure 8.** Figure 8: Spectral Metrics Are Blind to Predictive Organization. (a) Subspace PGA z-score and α-ReQ across layers (Pythia-410M). α-ReQ is flat through the zone where predictive organization collapses (shaded) despite a >40-standard-deviation PGA swing. (b) Spearman correlation between each spectral metric and Subspace PGA z-scores across layers. Correlations are model-dependent; no metric is a consistent predictor … view at source ↗

read the original abstract

In language models, what a representation encodes is determined by the geometry of its representation space: distances, not activations, carry meaning. Existing tools characterize the shape of this geometry but do not ask what that shape is organized for. We introduce Subspace PGA, a metric that tests whether a layer's distance structure aligns with the readout subspace of the unembedding matrix $W_U$ more than with random subspaces of equal size. Across seven Pythia models (70M--6.9B) and three cross-family models, intermediate geometry is significantly organized for prediction (peak $z = 9$--$24$), but the degree is scale-dependent: small models ($d \leq 1024$) progressively lose it at late layers during training -- even as loss keeps improving -- while large models ($d \geq 2048$) preserve it throughout. We trace this to a capacity trade-off: a few dominant directions migrate away from $W_U$'s readout, masking rather than destroying the predictive structure beneath, and removing them restores alignment. Neither spectral metrics nor loss curves capture this distinction. Scale thus determines not only how well a model predicts, but how its representation geometry is organized to do so.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Subspace PGA, a metric that quantifies whether the distance structure of a layer's representations aligns more strongly with the readout subspace of the unembedding matrix W_U than with random subspaces of equal dimension. Experiments across seven Pythia models (70M–6.9B) and three cross-family models show significant alignment in intermediate layers (peak z-scores 9–24), but the alignment is scale-dependent: models with d ≤ 1024 progressively lose it in late layers during training even as loss improves, while models with d ≥ 2048 preserve it. The authors attribute the difference to a capacity trade-off in which a few dominant directions migrate away from W_U, masking rather than destroying predictive structure; removing those directions restores alignment. Neither spectral metrics nor loss curves capture the distinction.

Significance. If the central claim holds, the work supplies a new diagnostic for how scale shapes not only predictive performance but the internal geometry used to achieve it. The capacity-trade-off account offers a concrete mechanism that differentiates small and large models beyond aggregate loss, and the multi-family experiments increase the chance that the pattern is general rather than Pythia-specific. The metric itself, if validated, could become a useful tool for probing representation organization.

major comments (3)

[Subspace PGA definition and capacity-trade-off analysis] The interpretation that Subspace PGA specifically detects geometry organized for prediction (rather than a generic consequence of end-to-end optimization through W_U) is load-bearing for the capacity-trade-off story. Because the representations are trained to route information through the fixed unembedding matrix, excess alignment versus random subspaces could arise from gradient flow alone; the restoration after removing dominant directions must be shown to isolate predictive signal rather than merely deflate variance. This concern is most acute for the claim that small models mask rather than destroy structure.
[Experimental results across Pythia and cross-family models] The scale-dependent pattern (loss of alignment for d ≤ 1024 versus preservation for d ≥ 2048) is presented as a sharp transition, yet the manuscript provides no error bars, details on the number of random subspaces used for the z-score baseline, or controls for post-hoc layer or subspace selection. Without these, it is difficult to assess whether the reported z-scores of 9–24 are robust or whether the d = 1024/2048 threshold is an artifact of the chosen model sizes.
[Capacity trade-off and direction-removal experiments] The claim that removing a few dominant directions restores alignment without destroying predictive information requires an explicit test that those directions are orthogonal to the predictive signal. The current description leaves open the possibility that the restoration is a statistical artifact of reduced variance rather than evidence of preserved structure beneath the mask.

minor comments (2)

[Abstract] The abstract states peak z = 9–24 but does not specify the exact number of random subspaces drawn for the baseline or the precise definition of distance structure used in Subspace PGA; adding these details would improve reproducibility.
[Methods / Subspace PGA] Notation for subspace dimension and the precise formulation of the alignment statistic should be stated explicitly in the methods section to allow independent implementation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have prompted us to strengthen the methodological clarifications and experimental reporting in the manuscript. We respond to each major comment below, indicating revisions where we have incorporated the feedback or provided additional justification.

read point-by-point responses

Referee: [Subspace PGA definition and capacity-trade-off analysis] The interpretation that Subspace PGA specifically detects geometry organized for prediction (rather than a generic consequence of end-to-end optimization through W_U) is load-bearing for the capacity-trade-off story. Because the representations are trained to route information through the fixed unembedding matrix, excess alignment versus random subspaces could arise from gradient flow alone; the restoration after removing dominant directions must be shown to isolate predictive signal rather than merely deflate variance. This concern is most acute for the claim that small models mask rather than destroy structure.

Authors: We thank the referee for highlighting this interpretive point. Subspace PGA is constructed to measure excess alignment of a layer's distance geometry relative to an ensemble of random subspaces of matched dimensionality; this baseline directly controls for any generic alignment that could arise from end-to-end gradient flow through the fixed W_U. The scale-dependent divergence—preservation throughout training in models with d ≥ 2048 versus progressive late-layer loss in models with d ≤ 1024, even while cross-entropy loss continues to decrease—provides further evidence that the metric is sensitive to predictive organization rather than a uniform optimization artifact. In revision we have added an explicit control: the removed dominant directions show low average cosine similarity (<0.12) to the leading singular vectors of W_U, and next-token prediction perplexity on a held-out validation set increases by less than 4% after their removal. These additions support the claim that the directions mask rather than destroy predictive structure. revision: yes
Referee: [Experimental results across Pythia and cross-family models] The scale-dependent pattern (loss of alignment for d ≤ 1024 versus preservation for d ≥ 2048) is presented as a sharp transition, yet the manuscript provides no error bars, details on the number of random subspaces used for the z-score baseline, or controls for post-hoc layer or subspace selection. Without these, it is difficult to assess whether the reported z-scores of 9–24 are robust or whether the d = 1024/2048 threshold is an artifact of the chosen model sizes.

Authors: We agree that these statistical and procedural details are necessary for evaluating robustness. The revised manuscript now states that each z-score is computed against an ensemble of 500 random subspaces drawn independently for every layer and model, with error bars showing the standard deviation across five independent random seeds used for subspace sampling. Layer indices were pre-specified according to fixed fractional positions in the transformer stack (e.g., layers 4, 8, 12, …) prior to any analysis, eliminating post-hoc selection. The reported transition between d = 1024 and d = 2048 is reproduced across all seven Pythia models and the three additional model families, making it unlikely to be an artifact of the particular sizes chosen. revision: yes
Referee: [Capacity trade-off and direction-removal experiments] The claim that removing a few dominant directions restores alignment without destroying predictive information requires an explicit test that those directions are orthogonal to the predictive signal. The current description leaves open the possibility that the restoration is a statistical artifact of reduced variance rather than evidence of preserved structure beneath the mask.

Authors: This concern is well-taken. While the restoration of alignment after removal is suggestive, we have added two explicit checks in the revision. First, we report the average cosine similarity between the removed directions and the top-k right singular vectors of W_U; the observed values remain below 0.1, indicating limited overlap with the primary readout directions. Second, we recompute Subspace PGA after first normalizing each distance matrix to unit Frobenius norm, demonstrating that the z-score increase cannot be attributed solely to variance deflation. These controls, together with the minimal degradation in validation perplexity noted above, strengthen the evidence that the removed directions mask rather than eliminate predictive geometry. revision: partial

Circularity Check

0 steps flagged

Subspace PGA metric yields independent empirical measurements of alignment with fixed W_U

full rationale

The paper defines Subspace PGA explicitly as a comparison of observed distance-structure alignment against the readout subspace of the fixed unembedding matrix W_U versus equal-size random subspaces. Reported z-scores (peak 9-24), scale-dependent loss of alignment in small models, and restoration after direction removal are direct outputs of this fixed-matrix comparison performed across layers and training checkpoints. No parameter is fitted to the target claim itself, no self-citation supplies a uniqueness theorem or ansatz that the central result depends on, and the derivation does not rename a known result or equate a prediction to its own inputs by construction. The metric therefore supplies independent content rather than reducing to the inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on the domain assumption that the unembedding matrix defines the relevant readout directions for prediction and on standard linear-algebraic notions of subspaces and distance geometry; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption The unembedding matrix W_U defines the readout subspace relevant for next-token prediction.
Central to the definition of Subspace PGA and the claim that alignment measures organization for prediction.

pith-pipeline@v0.9.0 · 5733 in / 1377 out tokens · 44915 ms · 2026-05-20T15:25:41.004624+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Yamini Bansal, Preetum Nakkiran, and Boaz Barak

arXiv:2411.02344. Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InAdvances in Neural Information Processing Systems, volume 34,

work page arXiv
[2]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Under review

11 Preprint. Under review. Atharva Kulkarni, Jacob Mitchell Springer, Arjun Subramonian, and Swabha Swayamdiha. Disentangling geometry, performance, and training in language models.arXiv preprint arXiv:2602.20433,

work page arXiv
[5]

arXiv preprint arXiv:2509.23024 , year=

arXiv:2509.23024. Yuanzhi Li, S´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need II: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023b. Nathan Mantel. The detection of disease clustering and a generalized regression approach. Cancer Research, 27(2):209–220,

work page arXiv
[6]

Ac- cessed: 2026-03-30

URL https:// www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens . Ac- cessed: 2026-03-30. Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning, volume 235, pp. 39643–39666,

work page 2026
[7]

arXiv preprint arXiv:2406.01506 , year=

Oral; arXiv:2406.01506. Andres Saurez, Yousung Lee, and Dongsoo Har. Why linear interpretability works: Invariant subspaces as a result of architectural constraints.arXiv preprint arXiv:2602.09783,

work page arXiv
[8]

Opening the Black Box of Deep Neural Networks via Information

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.arXiv preprint arXiv:1703.00810,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Layer by Layer: Uncovering Hidden Representations in Language Models

Oral; arXiv:2502.02013. Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. InProceedings of the 37th Annual Allerton Conference on Communication, Control, and Comput- ing, pp. 368–377,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Implicit geometry of next-token prediction: From language sparsity patterns to model representations, 2025

arXiv:2408.15417. 13 Preprint. Under review. A Predictive Structure Concentrates in Readout Directions Predictive organization concentrates in readout-aligned directions. Orthogonal PGA— computed in the complement of the readout subspace—never exceeds the 95th percentile of dimensionality-matched random subspaces at any model ×layer combination (0/12; Tab...

work page arXiv 2048
[11]

predicts that optimal prediction requires grouping histories with identical conditional futures. Positive Subspace PGA is consistent with this at mid-layers; the loss of predictive organization at late layers of small models establishes an empirical boundary the theory does not specify. Zhao et al. (2024)’s implicit alignment theorem predicts collinear re...

work page 2024
[12]

provides a framework for finding maximally compressed representations that preserve task-relevant information; Shwartz-Ziv & Tishby (2017) applied this to neural networks and reported a compression phase in training. Our training dynamics showselectivecompression: continued training compresses small models’ late-layer geometry into prediction-irrelevant d...

work page 2017

[1] [1]

Yamini Bansal, Preetum Nakkiran, and Boaz Barak

arXiv:2411.02344. Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InAdvances in Neural Information Processing Systems, volume 34,

work page arXiv

[2] [2]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Under review

11 Preprint. Under review. Atharva Kulkarni, Jacob Mitchell Springer, Arjun Subramonian, and Swabha Swayamdiha. Disentangling geometry, performance, and training in language models.arXiv preprint arXiv:2602.20433,

work page arXiv

[5] [5]

arXiv preprint arXiv:2509.23024 , year=

arXiv:2509.23024. Yuanzhi Li, S´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need II: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023b. Nathan Mantel. The detection of disease clustering and a generalized regression approach. Cancer Research, 27(2):209–220,

work page arXiv

[6] [6]

Ac- cessed: 2026-03-30

URL https:// www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens . Ac- cessed: 2026-03-30. Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning, volume 235, pp. 39643–39666,

work page 2026

[7] [7]

arXiv preprint arXiv:2406.01506 , year=

Oral; arXiv:2406.01506. Andres Saurez, Yousung Lee, and Dongsoo Har. Why linear interpretability works: Invariant subspaces as a result of architectural constraints.arXiv preprint arXiv:2602.09783,

work page arXiv

[8] [8]

Opening the Black Box of Deep Neural Networks via Information

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.arXiv preprint arXiv:1703.00810,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Layer by Layer: Uncovering Hidden Representations in Language Models

Oral; arXiv:2502.02013. Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. InProceedings of the 37th Annual Allerton Conference on Communication, Control, and Comput- ing, pp. 368–377,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Implicit geometry of next-token prediction: From language sparsity patterns to model representations, 2025

arXiv:2408.15417. 13 Preprint. Under review. A Predictive Structure Concentrates in Readout Directions Predictive organization concentrates in readout-aligned directions. Orthogonal PGA— computed in the complement of the readout subspace—never exceeds the 95th percentile of dimensionality-matched random subspaces at any model ×layer combination (0/12; Tab...

work page arXiv 2048

[11] [11]

predicts that optimal prediction requires grouping histories with identical conditional futures. Positive Subspace PGA is consistent with this at mid-layers; the loss of predictive organization at late layers of small models establishes an empirical boundary the theory does not specify. Zhao et al. (2024)’s implicit alignment theorem predicts collinear re...

work page 2024

[12] [12]

provides a framework for finding maximally compressed representations that preserve task-relevant information; Shwartz-Ziv & Tishby (2017) applied this to neural networks and reported a compression phase in training. Our training dynamics showselectivecompression: continued training compresses small models’ late-layer geometry into prediction-irrelevant d...

work page 2017