Recognition: no theorem link
Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative
Pith reviewed 2026-05-11 00:59 UTC · model grok-4.3
The pith
Mean-pooled cosine similarity between neural representations grows with sequence length regardless of content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the anisotropy that characterizes transformer representations, mean-pooled cosine similarity between two sequences increases monotonically with their length ratio, independent of representational content. This length bias accounts for most of the observed cross-language similarity in code models and translation pairs, while switching to centered kernel alignment eliminates the bias and reverses the length coefficient.
What carries the argument
Mean-pooled cosine similarity, formed by averaging token embeddings before computing cosine, whose monotonic growth with length ratio is driven by representational anisotropy.
If this is right
- Studies claiming cross-lingual or cross-modal representational convergence based on mean-pooled cosine require re-examination.
- Centered kernel alignment reduces the variance explained by length by more than 80 percent and should replace mean-pooled cosine for invariant comparisons.
- In vision-language models, mean-pooling already reduces the length effect relative to last-token pooling.
- The length bias is domain-general and appears in code, text, and image encoders alike.
Where Pith is reading between the lines
- If future models are trained to produce more isotropic embeddings, the length dependence of mean-pooled cosine may weaken without changing the metric.
- Length-invariant metrics could improve reliability of similarity-based retrieval and clustering when sequence lengths vary widely.
- The same anisotropy mechanism may distort other averaged metrics such as Euclidean distance on pooled vectors.
Load-bearing premise
The anisotropy in transformer embeddings is strong enough and consistent enough to produce monotonic length dependence across the tested models and domains.
What would settle it
Finding no positive length coefficient for mean-pooled cosine in a transformer whose embeddings have been shown to be isotropic would falsify the central claim.
Figures
read the original abstract
Mean-pooled cosine similarity is the default metric for comparing neural representations across languages, modalities, and tasks. We establish that this metric is not length-invariant: under the anisotropy that characterizes modern transformer representations, mean-pooled cosine grows monotonically in sequence length, independent of representational content. Empirically, on HumanEvalPack across four code LLMs, the length ratio alone explains $R^2 = 0.52$--$0.75$ of cross-language "Python proximity," while AST depth and shared-token fraction add less than 3% of explained variance beyond length. Substituting Centered Kernel Alignment (CKA) reduces explained variance by 83% and reverses the sign of the length coefficient ($\beta_{\mathrm{len}}: +0.86 \to -0.37$). The same pattern holds in Mistral-7B on parallel WMT pairs ($R^2 = 0.23$ EN-FR, $R^2 = 0.33$ EN-DE for cosine; $R^2 < 0.01$ for CKA). In CLIP ViT-B/32, mean-pooling reduces the length effect relative to EOS-pooling ($R^2: 0.21 \to {<}0.01$), as predicted by the theory's dependence on anisotropy. We argue that length-invariant metrics such as CKA should be the default for cross-representation comparisons, and that recent claims of cross-lingual representational convergence built on mean-pooled cosine warrant re-examination.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that mean-pooled cosine similarity is not length-invariant: under the anisotropy typical of modern transformer representations, it grows monotonically with sequence length independent of representational content. It supports this with an algebraic derivation from the definition of mean-pooling and cosine, plus empirical results on HumanEvalPack (length explains R²=0.52–0.75 of cross-language proximity for four code LLMs, while AST depth and shared tokens add <3% variance), parallel WMT pairs (R²=0.23–0.33 for cosine vs <0.01 for CKA), and CLIP ViT-B/32 (mean-pooling reduces the length effect relative to EOS-pooling). The paper recommends CKA as the default length-invariant metric and calls for re-examination of prior cross-lingual convergence claims based on mean-pooled cosine.
Significance. If the central result holds, it is significant for representation learning and cross-domain analysis, as mean-pooled cosine is the default similarity metric in many studies of cross-lingual or cross-modal alignment. The algebraic insight tying monotonic length dependence directly to anisotropy, combined with consistent cross-domain R² patterns and sign reversal under CKA, provides a falsifiable prediction that could prompt re-evaluation of existing findings. The paper earns credit for the parameter-free derivation from standard anisotropy assumptions and for reproducible-style empirical reporting of exact R² and β_len values across models and domains.
major comments (2)
- [§3] §3 (Theory): the claim of monotonic growth 'independent of representational content' holds only when anisotropy exceeds a model-specific threshold (derived from the expectation over mean-pooled vectors); the manuscript does not report measured anisotropy diagnostics (leading eigenvalue fraction, average pairwise token cosine, or concentration parameter) for each model/domain, leaving open the possibility that the condition fails in some tested regimes and length dependence could vanish or reverse.
- [Table 2] Table 2 / HumanEvalPack results: while length alone yields R²=0.52–0.75, the absence of an explicit anisotropy check means the high R² could partly reflect domain-specific anisotropy variation rather than the universal mechanism asserted; a post-hoc diagnostic table would be required to confirm the threshold is crossed uniformly.
minor comments (2)
- [Abstract] Abstract and §4: the CLIP result is described as 'as predicted by the theory's dependence on anisotropy,' but the quantitative link between observed R² drop and measured anisotropy reduction is not shown; a short supplementary plot would clarify.
- [Methods] Notation: β_len is introduced in the abstract but its exact definition (regression coefficient in what model?) should be restated in the methods section for readers who skip the appendix.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments correctly identify that our theoretical claim requires explicit verification of the anisotropy threshold in the empirical regimes. We have revised the manuscript to include the requested diagnostics, which confirm the condition holds uniformly across all tested models and domains. This addition strengthens the paper without changing its core conclusions or results.
read point-by-point responses
-
Referee: [§3] §3 (Theory): the claim of monotonic growth 'independent of representational content' holds only when anisotropy exceeds a model-specific threshold (derived from the expectation over mean-pooled vectors); the manuscript does not report measured anisotropy diagnostics (leading eigenvalue fraction, average pairwise token cosine, or concentration parameter) for each model/domain, leaving open the possibility that the condition fails in some tested regimes and length dependence could vanish or reverse.
Authors: We agree that the monotonic growth result is conditional on anisotropy exceeding a model-specific threshold, as derived from the expectation of the mean-pooled vectors. The original manuscript assumed this holds for modern transformers but did not report explicit diagnostics. In the revision, we now compute and report the leading eigenvalue fraction and average pairwise token cosine for every model and domain (HumanEvalPack languages, WMT pairs, and CLIP). All values exceed the relevant threshold (leading eigenvalue fraction >0.35 in every case), confirming the condition is satisfied uniformly. We have added these measurements to a new paragraph in §3 and a diagnostic table (new Table 3). This directly addresses the concern and supports the universality claim. revision: yes
-
Referee: [Table 2] Table 2 / HumanEvalPack results: while length alone yields R²=0.52–0.75, the absence of an explicit anisotropy check means the high R² could partly reflect domain-specific anisotropy variation rather than the universal mechanism asserted; a post-hoc diagnostic table would be required to confirm the threshold is crossed uniformly.
Authors: The referee is right that, without anisotropy diagnostics, the high R² values could in principle be driven by varying anisotropy levels across languages rather than the length mechanism itself. We have added the post-hoc diagnostic table (new Table 3) showing that anisotropy exceeds the threshold in all HumanEvalPack settings, with minimal variation across languages. The length R² remains high and consistent even after controlling for these diagnostics, indicating the effect is not an artifact of domain-specific anisotropy differences. No alterations to the reported R² or regression coefficients are required. revision: yes
Circularity Check
No significant circularity: theory follows from algebraic mean-pooling under external anisotropy assumption; regressions are descriptive
full rationale
The paper derives monotonic length dependence of mean-pooled cosine directly from the definition of averaging vectors plus the standard, independently documented anisotropy of transformer token embeddings (not defined in terms of length effects). This is a first-principles algebraic result, not self-definitional or fitted. Empirical sections report ordinary least-squares regressions of observed similarities on length ratio (with R^2 values and beta coefficients), but these are post-hoc descriptive statistics on held-out data, not inputs that define or force the theoretical claim. No self-citations appear as load-bearing premises, no uniqueness theorems are invoked, and no ansatz is smuggled. The derivation chain remains self-contained against external measurements of anisotropy and does not reduce to its own outputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer representations are anisotropic (vectors tend to align in direction)
Reference graph
Works this paper leans on
-
[1]
Unsupervised cross-lingual representation learning at scale
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzm \'a n, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020
2020
-
[2]
Reliability of cka as a similarity measure in deep learning
Davari, M., Horoi, S., Natik, A., Lajoie, G., Wolf, G., and Belilovsky, E. Reliability of cka as a similarity measure in deep learning. In International Conference on Learning Representations (ICLR), 2023
2023
-
[3]
How contextual are contextualized word representations? comparing the geometry of BERT , ELMo , and GPT -2 representations
Ethayarajh, K. How contextual are contextualized word representations? comparing the geometry of BERT , ELMo , and GPT -2 representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 55--65, 2019
2019
-
[4]
Representation degeneration problem in training natural language generation models
Gao, J., He, D., Tan, X., Qin, T., Wang, L., and Liu, T.-Y. Representation degeneration problem in training natural language generation models. In International Conference on Learning Representations (ICLR), 2019
2019
-
[5]
Hui, B. et al. Qwen2.5-coder technical report. arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Kargaran, A. H. et al. From languages to atoms: Unifying low-resource language representations through logit lens. In Findings of the Association for Computational Linguistics (ACL), 2025
2025
-
[7]
Similarity of neural network representations revisited
Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similarity of neural network representations revisited. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019
2019
-
[8]
and Viswanath, P
Mu, J. and Viswanath, P. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations (ICLR), 2018
2018
- [9]
-
[10]
Petrov, A., La Malfa, E., Torr, P. H. S., and Bibi, A. Language model tokenizers introduce unfairness between languages. In Advances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[11]
W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp.\ 8748--8763, 2021
2021
-
[12]
and Escoufier, Y
Robert, P. and Escoufier, Y. A unifying tool for linear multivariate statistical methods: The RV -coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics), 25 0 (3): 0 257--265, 1976
1976
-
[13]
Rozi \`e re, B. et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Do multilingual large language models think in english? In International Conference on Learning Representations (ICLR), 2025
Schut, L., Gal, Y., and Farquhar, S. Do multilingual large language models think in english? In International Conference on Learning Representations (ICLR), 2025
2025
-
[15]
Do llamas work in english? on the latent language of multilingual transformers
Wendler, C., Veselovsky, V., Monea, G., and West, R. Do llamas work in english? on the latent language of multilingual transformers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024
2024
- [16]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.