On the Predictive Power of Representation Dispersion in Language Models
Pith reviewed 2026-05-19 07:02 UTC · model grok-4.3
The pith
Language models with more widely dispersed contextual representations achieve lower perplexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Representation dispersion—the average pairwise cosine distance among hidden vectors—strongly and negatively correlates with perplexity across diverse model families and domains. Beyond the correlation, dispersion supports practical uses: ranking unlabeled examples by difficulty, selecting high-dispersion layers for retrieval methods such as kNN-LM, and adding a push-away objective to training that raises dispersion and lowers perplexity.
What carries the argument
Representation dispersion, defined as the average pairwise cosine distance among hidden vectors, which tracks the breadth of the embedding space and serves as a direct correlate of predictive accuracy.
If this is right
- Dispersion computed on unlabeled text can rank examples by difficulty and surface hard slices in new domains without labeled data.
- Layers with elevated dispersion can be chosen as the best inputs for kNN-LM, eliminating the need to search every layer exhaustively.
- Adding a push-away training objective raises dispersion and improves perplexity in both single-domain and cross-domain settings.
Where Pith is reading between the lines
- Dispersion could serve as a cheap proxy for screening candidate models on new domains before running full perplexity evaluations.
- Training routines that systematically increase dispersion might offer an alternative route to better generalization without changing model scale.
Load-bearing premise
That average pairwise cosine distance among hidden vectors captures a genuine, non-confounded aspect of predictive capability rather than being driven by model scale or training procedure.
What would settle it
A collection of models in which higher dispersion fails to produce lower perplexity once model size, training data volume, and architecture are matched.
read the original abstract
We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion--the average pairwise cosine distance among hidden vectors--strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks--without requiring labeled data. First, measuring dispersion on unlabeled text allows us to rank examples by difficulty and identify hard slices in new domains, offering a data-efficient tool for screening and prioritizing models before full evaluation. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple "push-away" objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each. Code is available at https://github.com/yanhong-lbh/rep_dispersion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a language model's ability to predict text is tightly linked to the breadth of its embedding space, with models that spread their contextual representations more widely (measured as average pairwise cosine distance among hidden vectors, termed representation dispersion) tending to achieve lower perplexity. This negative correlation is reported across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). The work further shows practical applications of dispersion without labeled data: ranking examples by difficulty to identify hard slices, pinpointing high-dispersion layers for retrieval methods like kNN-LM, and integrating a 'push-away' objective into training that increases dispersion and improves perplexity in single- and cross-domain settings. Code is provided for reproducibility.
Significance. If the central correlation is shown to be robust to confounds, the result could meaningfully advance understanding of how representation geometry relates to predictive performance in language models and supply simple, label-free tools for data screening, layer selection, and training. The public code release supports verification and extension.
major comments (2)
- [Abstract] Abstract: the reported strong negative correlation between representation dispersion and perplexity is presented without any indication of controls for model scale, parameter count, or training tokens. Because larger models commonly produce both wider representation spreads and lower perplexity, the correlation could be an artifact of scale rather than an independent effect of dispersion; recomputing the correlation within fixed-size cohorts or after regressing out these variables is required to support the claim that dispersion captures a distinct driver of predictive capability.
- [Abstract] Abstract: the 'push-away' objective is explicitly constructed to increase dispersion, so the observed perplexity gains may follow directly from the objective's design rather than furnishing independent evidence for the dispersion-perplexity relationship. Additional comparisons (e.g., to objectives that modulate dispersion differently or ablations that isolate the dispersion component) would be needed to strengthen this part of the argument.
minor comments (1)
- [Abstract] Abstract: the phrase 'diverse model families (LLaMA, Qwen, and others)' and 'domains (Wikipedia, news, scientific abstracts)' would be more informative if the exact counts or ranges of models and evaluation sets were stated, allowing readers to gauge the breadth of the reported correlation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below and outline planned revisions to strengthen the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported strong negative correlation between representation dispersion and perplexity is presented without any indication of controls for model scale, parameter count, or training tokens. Because larger models commonly produce both wider representation spreads and lower perplexity, the correlation could be an artifact of scale rather than an independent effect of dispersion; recomputing the correlation within fixed-size cohorts or after regressing out these variables is required to support the claim that dispersion captures a distinct driver of predictive capability.
Authors: We agree that model scale is a potential confound and that explicit controls are needed to isolate the contribution of representation dispersion. In the revised manuscript we will add new analyses that recompute the dispersion-perplexity correlation within fixed-size cohorts (e.g., all 7B models across families) and will report partial correlations after regressing out parameter count and training tokens. These additions will directly test whether dispersion retains predictive power beyond scale. revision: yes
-
Referee: [Abstract] Abstract: the 'push-away' objective is explicitly constructed to increase dispersion, so the observed perplexity gains may follow directly from the objective's design rather than furnishing independent evidence for the dispersion-perplexity relationship. Additional comparisons (e.g., to objectives that modulate dispersion differently or ablations that isolate the dispersion component) would be needed to strengthen this part of the argument.
Authors: We acknowledge that the push-away objective is deliberately designed to increase dispersion, so the resulting perplexity gains provide supportive but not fully independent evidence. To strengthen this section we will add ablations that compare against alternative objectives modulating dispersion differently and will include controls that isolate the dispersion term from other training effects. These experiments will be reported in the revised version. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper reports an empirical negative correlation between representation dispersion (average pairwise cosine distance) and perplexity across model families and domains, then demonstrates practical uses of dispersion and an intervention via a push-away objective that increases dispersion while improving perplexity. No equations, self-citations, fitted parameters renamed as predictions, or derivations reducing to inputs by construction appear in the provided abstract. The correlation is observational and the intervention tests the link rather than assuming it tautologically; the central claim remains independent of its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cosine distance is an appropriate measure for the breadth of contextual representations in language model hidden states.
Forward citations
Cited by 1 Pith paper
-
Contrastive Regularization for Accent-Robust ASR
Supervised contrastive learning as an auxiliary loss during CTC fine-tuning improves accent robustness in ASR, yielding up to 29% relative WER reduction on unseen accents.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.