On the Predictive Power of Representation Dispersion in Language Models

Jiawei Zhou; Karen Livescu; Ming Li; Yanhong Li

arxiv: 2506.24106 · v2 · submitted 2025-06-30 · 💻 cs.CL · cs.AI

On the Predictive Power of Representation Dispersion in Language Models

Yanhong Li , Ming Li , Karen Livescu , Jiawei Zhou This is my paper

Pith reviewed 2026-05-19 07:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords representation dispersionlanguage modelsperplexitycontextual representationsmodel evaluationkNN-LMtraining objectives

0 comments

The pith

Language models with more widely dispersed contextual representations achieve lower perplexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a language model's text prediction performance is closely tied to how broadly it spreads its internal representations. It measures this spread as the average pairwise cosine distance among hidden vectors and finds a strong negative correlation with perplexity. The link holds across multiple model families and text domains. If correct, the finding supplies a label-free signal that can guide model selection, layer choice, and training adjustments.

Core claim

What carries the argument

Representation dispersion, defined as the average pairwise cosine distance among hidden vectors, which tracks the breadth of the embedding space and serves as a direct correlate of predictive accuracy.

If this is right

Dispersion computed on unlabeled text can rank examples by difficulty and surface hard slices in new domains without labeled data.
Layers with elevated dispersion can be chosen as the best inputs for kNN-LM, eliminating the need to search every layer exhaustively.
Adding a push-away training objective raises dispersion and improves perplexity in both single-domain and cross-domain settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dispersion could serve as a cheap proxy for screening candidate models on new domains before running full perplexity evaluations.
Training routines that systematically increase dispersion might offer an alternative route to better generalization without changing model scale.

Load-bearing premise

That average pairwise cosine distance among hidden vectors captures a genuine, non-confounded aspect of predictive capability rather than being driven by model scale or training procedure.

What would settle it

A collection of models in which higher dispersion fails to produce lower perplexity once model size, training data volume, and architecture are matched.

read the original abstract

We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion--the average pairwise cosine distance among hidden vectors--strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks--without requiring labeled data. First, measuring dispersion on unlabeled text allows us to rank examples by difficulty and identify hard slices in new domains, offering a data-efficient tool for screening and prioritizing models before full evaluation. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple "push-away" objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each. Code is available at https://github.com/yanhong-lbh/rep_dispersion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that a language model's ability to predict text is tightly linked to the breadth of its embedding space, with models that spread their contextual representations more widely (measured as average pairwise cosine distance among hidden vectors, termed representation dispersion) tending to achieve lower perplexity. This negative correlation is reported across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). The work further shows practical applications of dispersion without labeled data: ranking examples by difficulty to identify hard slices, pinpointing high-dispersion layers for retrieval methods like kNN-LM, and integrating a 'push-away' objective into training that increases dispersion and improves perplexity in single- and cross-domain settings. Code is provided for reproducibility.

Significance. If the central correlation is shown to be robust to confounds, the result could meaningfully advance understanding of how representation geometry relates to predictive performance in language models and supply simple, label-free tools for data screening, layer selection, and training. The public code release supports verification and extension.

major comments (2)

[Abstract] Abstract: the reported strong negative correlation between representation dispersion and perplexity is presented without any indication of controls for model scale, parameter count, or training tokens. Because larger models commonly produce both wider representation spreads and lower perplexity, the correlation could be an artifact of scale rather than an independent effect of dispersion; recomputing the correlation within fixed-size cohorts or after regressing out these variables is required to support the claim that dispersion captures a distinct driver of predictive capability.
[Abstract] Abstract: the 'push-away' objective is explicitly constructed to increase dispersion, so the observed perplexity gains may follow directly from the objective's design rather than furnishing independent evidence for the dispersion-perplexity relationship. Additional comparisons (e.g., to objectives that modulate dispersion differently or ablations that isolate the dispersion component) would be needed to strengthen this part of the argument.

minor comments (1)

[Abstract] Abstract: the phrase 'diverse model families (LLaMA, Qwen, and others)' and 'domains (Wikipedia, news, scientific abstracts)' would be more informative if the exact counts or ranges of models and evaluation sets were stated, allowing readers to gauge the breadth of the reported correlation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and outline planned revisions to strengthen the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the reported strong negative correlation between representation dispersion and perplexity is presented without any indication of controls for model scale, parameter count, or training tokens. Because larger models commonly produce both wider representation spreads and lower perplexity, the correlation could be an artifact of scale rather than an independent effect of dispersion; recomputing the correlation within fixed-size cohorts or after regressing out these variables is required to support the claim that dispersion captures a distinct driver of predictive capability.

Authors: We agree that model scale is a potential confound and that explicit controls are needed to isolate the contribution of representation dispersion. In the revised manuscript we will add new analyses that recompute the dispersion-perplexity correlation within fixed-size cohorts (e.g., all 7B models across families) and will report partial correlations after regressing out parameter count and training tokens. These additions will directly test whether dispersion retains predictive power beyond scale. revision: yes
Referee: [Abstract] Abstract: the 'push-away' objective is explicitly constructed to increase dispersion, so the observed perplexity gains may follow directly from the objective's design rather than furnishing independent evidence for the dispersion-perplexity relationship. Additional comparisons (e.g., to objectives that modulate dispersion differently or ablations that isolate the dispersion component) would be needed to strengthen this part of the argument.

Authors: We acknowledge that the push-away objective is deliberately designed to increase dispersion, so the resulting perplexity gains provide supportive but not fully independent evidence. To strengthen this section we will add ablations that compare against alternative objectives modulating dispersion differently and will include controls that isolate the dispersion term from other training effects. These experiments will be reported in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports an empirical negative correlation between representation dispersion (average pairwise cosine distance) and perplexity across model families and domains, then demonstrates practical uses of dispersion and an intervention via a push-away objective that increases dispersion while improving perplexity. No equations, self-citations, fitted parameters renamed as predictions, or derivations reducing to inputs by construction appear in the provided abstract. The correlation is observational and the intervention tests the link rather than assuming it tautologically; the central claim remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides minimal detail on assumptions; the dispersion metric relies on standard vector similarity measures without explicit justification or controls for alternative explanations.

axioms (1)

domain assumption Cosine distance is an appropriate measure for the breadth of contextual representations in language model hidden states.
Directly used to define representation dispersion in the abstract.

pith-pipeline@v0.9.0 · 5707 in / 1235 out tokens · 35086 ms · 2026-05-19T07:02:53.311463+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contrastive Regularization for Accent-Robust ASR
cs.SD 2026-05 unverdicted novelty 4.0

Supervised contrastive learning as an auxiliary loss during CTC fine-tuning improves accent robustness in ASR, yielding up to 29% relative WER reduction on unseen accents.