pith. sign in

arxiv: 2505.17101 · v6 · pith:C53B7FT5new · submitted 2025-05-21 · 💻 cs.CL · cs.LG· physics.comp-ph

A quantitative analysis of semantic information in deep representations of text and images

Pith reviewed 2026-05-22 14:33 UTC · model grok-4.3

classification 💻 cs.CL cs.LGphysics.comp-ph
keywords semantic convergenceinformation imbalancedeep representationscross-modal predictabilitylayer-wise analysislanguage modelsvision modelsasymmetric predictability
0
0 comments X

The pith

Deep representations of text and images align on shared semantic information across languages, modalities, and model architectures, with directed predictability peaking in middle layers and showing asymmetries by language and scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how well one model's internal activations can predict another's for the same or related inputs, using a rank-based proxy for cross-entropy that works in high dimensions. Applied to translations processed by DeepSeek-V3 across six language pairs, semantic content appears spread over many tokens and most predictable in central layers. The same layers also show the strongest links to visual representations of image captions, though English activations predict others more effectively and larger models predict smaller ones more than the reverse. In vision models the concentration shifts depending on whether the architecture is autoregressive or an encoder. These patterns are presented as evidence that models converge on similar semantic structures while the direction and strength of prediction still depend on depth, size, and language.

Core claim

Measurements of Information Imbalance between representations show that semantic information is distributed across many tokens and reaches peak predictability in a set of central layers for language models, in middle layers for autoregressive vision models, and in final layers for encoder vision models; those same layers produce the strongest cross-modal links to textual caption representations, with English representations more predictive than others and larger-model representations more predictive of smaller-model ones.

What carries the argument

Information Imbalance, the asymmetric rank-based measure that quantifies how well one high-dimensional representation can predict another as a proxy for cross-entropy.

If this is right

  • Semantic information spreads across many tokens rather than concentrating in a few.
  • Predictability between representations is strongest in central layers for text models and varies by layer type in vision models.
  • English representations are systematically more predictive of other languages than the reverse.
  • Larger models predict smaller-model representations more effectively than the smaller models predict the larger ones.
  • The layers holding the most semantic content within each modality also yield the strongest cross-modal predictability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed convergence may point to an underlying shared semantic geometry that different training regimes approximate.
  • Layer-specific predictability patterns could be used to select which activations to align when building multimodal systems.
  • The English-centric asymmetry raises the question of how much the convergence depends on training-data language balance.
  • Similar analyses on models trained from scratch on balanced multilingual data could test whether the asymmetries persist.

Load-bearing premise

The Information Imbalance metric faithfully captures semantic predictability without substantial distortion from the high-dimensional rank approximation or from the particular models and inputs chosen.

What would settle it

A new experiment applying the same Information Imbalance analysis to models trained on entirely non-overlapping data distributions or to a fresh modality such as audio would show no layer-wise concentration or cross-modal alignment if the central claim is incorrect.

Figures

Figures reproduced from arXiv: 2505.17101 by Alessandro Laio, Andrea Mascaretti, Marco Baroni, Mat\'eo Mahaut, Riccardo Rende, Santiago Acevedo.

Figure 1
Figure 1. Figure 1: a) Information Imbalance (II) ∆(X →Y ) and ∆(Y →X), CKA, and Neighborhood Overlap (NO) for a synthetic Gaussian construction in which each index r generates a pair (Xr, Yr) via Yr = BrXr +ε, with Xr ∼ N (0, I), ε ∼ N (0, σ2 I), in p = 10 dimensions. The matrices Br ∈ R p×p have monotonically increasing rank, from one at r=1 to full rank at the final index. Note that smaller II means more predictivity, wher… view at source ↗
Figure 2
Figure 2. Figure 2: Representation choice. Information Imbalance from English to Italian tranlsations in DeepSeek￾V3 when tokens are a) averaged or b) concatenated. The (hardly visible) shaded colored areas correspond to the standard deviation obtained by subsampling half of the samples five times. predictability seems driven by semantic correspondences, this result suggests that semantic information is spread across many tok… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison with other languages and models. Panel a): Information Imbalance from English to several languages, computed on representations generated by DeepSeek-V3. Panel b): Information Imbalance from English to Italian, computed on representations generated by Llama3 models with 1,3, and 8 billion parameters, and by DeepSeek-V3 for comparison. In both panels we used the average of the last 20 tokens to r… view at source ↗
Figure 4
Figure 4. Figure 4: Information asymmetries. Panel a): Information Imbalance (II) from English to Italian and from Italian to English, computed on representations generated by DeepSeek-V3. Panel b): II Asymmetry A = II(English → other) − II(other → English) between English and other languages, computed on representations generated by DeepSeek-V3. Note that, under this definition of asymmetry, a negative value implies that Eng… view at source ↗
Figure 5
Figure 5. Figure 5: Token-token Information Imbalance. Information Imbalance from the last token to a previous token at distance τ , as a function of τ . Panels a) and b) correspond to DeepSeek-V3 representations of English and Italian text, whereas panels c) and d) correspond to Llama3-8b representations of English and Italian. The (hardly visible) shaded area corresponds to one standard deviation, computed with a Jackknife … view at source ↗
Figure 6
Figure 6. Figure 6: a) Within-model Information Imbalance for 1,000 same-class image pairs from ImageNet-1k, using mean-token activations. For each model, we compare the representations of two distinct images of the same class at the same layer, and plot the result as a function of relative depth. DinoV2 reaches its minimum at the last layer; ImageGPT at ≈ 42% relative depth. b) Cross-modal Information Imbalance on Flickr30k … view at source ↗
Figure 7
Figure 7. Figure 7: Information Imbalance between English (en) and Spanish (es) representations generated by [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: II Asymmetry A = II(English → other) − II(other → English) between English and other languages, computed on representations generated by Llama3-8b. Note that, under this definition of asym￾metry, a negative value implies that English is more informative than the other language. First, we highlight that, for all three representational choices, the three metrics concurrently achieve their maximum score (lowe… view at source ↗
Figure 9
Figure 9. Figure 9: Central Kernel Alignment (CKA) and Neighborhood Overlap (NO) comparison on translations. Information Imbalance (II) from English to Italian and from Italian to English, together with CKA and Neighborhood Overlap computed on the same activations, using a) the last token, b) the average of the last 20 tokens, and c) the concatenation of the last 20 tokens. Note that for II, lower values cue higher alignment,… view at source ↗
Figure 10
Figure 10. Figure 10: a) Cross-modal Information Imbalance on Flickr30k image–caption pairs with LLama3.1-8B (≈ 59% relative depth) as the text encoder, swept against DinoV2-large and image-gpt-large. b) DinoV2 model-size comparison: DeepSeek-V3 (≈ 60% relative depth) swept against DinoV2-large, DinoV2-base, and DinoV2-small. c) Information Imbalance from DinoV2 (last layer) and image-gpt-large (≈ 42% relative depth) against D… view at source ↗
read the original abstract

It was recently observed that the representations of different models that process identical or semantically related inputs tend to align. We analyze this phenomenon using the Information Imbalance, an asymmetric rank-based measure that quantifies the capability of a representation to predict another, providing a proxy of the cross-entropy which can be computed efficiently in high-dimensional spaces. By measuring the Information Imbalance between representations generated by DeepSeek-V3 processing translations, we find that semantic information is spread across many tokens, and that semantic predictability is strongest in a set of central layers of the network, robust across six language pairs. We measure clear information asymmetries: English representations are systematically more predictive than those of other languages, and DeepSeek-V3 representations are more predictive of those in a smaller model such as Llama3-8b than the opposite. In the visual domain, we observe that semantic information concentrates in middle layers for autoregressive models and in final layers for encoder models, and these same layers yield the strongest cross-modal predictability with textual representations of image captions. Our results support the hypothesis of semantic convergence across languages, modalities, and architectures, while showing that directed predictability between representations varies strongly with layer-depth, model scale, and language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes semantic convergence in deep representations using the Information Imbalance, an asymmetric rank-based proxy for cross-entropy between model outputs. On text, it reports that semantic information is distributed across tokens in DeepSeek-V3, with strongest directed predictability in central layers across six language pairs; English representations are more predictive than those of other languages, and DeepSeek-V3 representations predict Llama3-8b outputs better than the reverse. On images, semantic information concentrates in middle layers for autoregressive models and final layers for encoders, with peak cross-modal predictability to caption text occurring in the same layers. The central claim is that these patterns support semantic convergence across languages, modalities, and architectures while showing that directed predictability varies systematically with layer depth, model scale, and language.

Significance. If the metric faithfully captures semantic predictability, the work supplies quantitative evidence for cross-lingual and cross-modal alignment in neural representations and identifies layer-specific loci of semantic content. The efficient, parameter-free nature of the rank-based measure in high dimensions is a methodological strength that enables the reported comparisons without additional fitting. The patterns are falsifiable through replication on other models or inputs.

major comments (1)
  1. [Methods (Information Imbalance definition and application)] The load-bearing assumption that Information Imbalance provides an undistorted asymmetric proxy for semantic predictability and cross-entropy must be validated against high-dimensional rank-estimation artifacts (nearest-neighbor sensitivity to local density, embedding norm, or k). The abstract and methods description give no sign of such controls or comparisons to direct entropy estimates; without them the reported asymmetries (English > other languages; DeepSeek-V3 > Llama3-8b) and layer-wise peaks cannot be confidently attributed to semantic content rather than metric bias. This directly affects the central claims.
minor comments (2)
  1. [Experimental setup] Clarify the exact token sampling and exclusion rules used for the multilingual translation experiments; the abstract mentions 'many tokens' but does not specify how inputs were prepared or whether length normalization was applied.
  2. [Visual domain experiments] Add a brief statement on the number of image-caption pairs and the precise visual models employed; this would help readers assess the scope of the cross-modal results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the methodological strengths of the Information Imbalance approach. We address the major comment on metric validation below and will incorporate additional controls in the revision.

read point-by-point responses
  1. Referee: The load-bearing assumption that Information Imbalance provides an undistorted asymmetric proxy for semantic predictability and cross-entropy must be validated against high-dimensional rank-estimation artifacts (nearest-neighbor sensitivity to local density, embedding norm, or k). The abstract and methods description give no sign of such controls or comparisons to direct entropy estimates; without them the reported asymmetries (English > other languages; DeepSeek-V3 > Llama3-8b) and layer-wise peaks cannot be confidently attributed to semantic content rather than metric bias. This directly affects the central claims.

    Authors: We agree that explicit checks for rank-estimation artifacts would increase confidence in attributing the observed asymmetries and layer-wise patterns to semantic content. The current manuscript does not report dedicated sensitivity analyses or direct entropy comparisons, relying instead on the parameter-free nature of the rank-based proxy and the consistency of results across six language pairs, multiple models, and both text and image modalities. In the revised version we will add: (i) robustness tests varying the neighbor parameter k across a range of values, (ii) results after L2-normalizing all embeddings to control for norm effects, and (iii) mutual-information estimates on PCA-reduced representations for a representative subset of layers as a proxy comparison to direct entropy. These additions should help confirm that the reported English > other-language and DeepSeek-V3 > Llama3-8b directed predictabilities, as well as the central-layer peaks, are not driven by local-density or norm biases. revision: yes

Circularity Check

0 steps flagged

Minor self-citation for Information Imbalance metric but central empirical claims remain independent

full rationale

The paper introduces the Information Imbalance as an asymmetric rank-based proxy for cross-entropy between high-dimensional representations and applies it to measure predictability across layers, models (DeepSeek-V3, Llama3-8b), languages, and modalities. The reported patterns of semantic convergence and directed asymmetries are direct empirical outputs of this metric applied to fixed model activations on translations and image captions. No equation or result reduces to a fitted parameter or input definition by construction, and any self-citation for the metric definition is not load-bearing for the convergence hypothesis. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the assumption that Information Imbalance accurately proxies semantic cross-entropy and that the chosen models and translations are representative of semantic processing.

axioms (1)
  • domain assumption Information Imbalance provides an efficient proxy for cross-entropy between high-dimensional representations.
    Stated in the abstract as the justification for using the measure to quantify semantic predictability.

pith-pipeline@v0.9.0 · 5762 in / 1325 out tokens · 44538 ms · 2026-05-22T14:33:24.443343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

    cs.AI 2026-05 unverdicted novelty 7.0

    Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    URLhttps://link.aps.org/doi/10.1103/kd73-93cg

    doi: 10.1103/kd73-93cg. URLhttps://link.aps.org/doi/10.1103/kd73-93cg. Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neu- ral representations. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.),Advances in Neural Information Processing Systems, volume 34, pp. 225–236. Cur- ran Asso...

  2. [2]

    Jannik Brinkmann, Chris Wendler, Christian Bartelt, and Aaron Mueller

    URLhttps://proceedings.neurips.cc/paper_files/paper/2021/file/ 01ded4259d101feb739b06c399e9cd9c-Paper.pdf. Jannik Brinkmann, Chris Wendler, Christian Bartelt, and Aaron Mueller. Large language models share representations of latent grammatical concepts across typologically diverse languages,

  3. [3]

    Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486,

    URLhttps://arxiv.org/abs/2602.14486. Eghbal Hosseini and Evelina Fedorenko. Large language models implicitly learn to straighten neural sen- tence trajectories to construct a predictive representation of natural language. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.),Advances in Neural Information Processing Sys- tems, volu...

  4. [4]

    cc/paper_files/paper/2023/file/88dddaf430b5bc38ab8228902bb61821-Paper-Conference.pdf

    URLhttps://proceedings.neurips. cc/paper_files/paper/2023/file/88dddaf430b5bc38ab8228902bb61821-Paper-Conference.pdf. Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hy- pothesis. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp ...

  5. [5]

    Similarity of Neural Network Representations Revisited

    doi: 10.1162/tacl_a_00300. URLhttps://aclanthology.org/ 2020.tacl-1.5/. Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of ICML, pp. 3519–3529, Long Beach, CA, 2019a. Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey E. Hinton. Similarity of neural networ...

  6. [6]

    13 Danni Liu and Jan Niehues

    doi: 10.1007/s11263-018-1098-y. 13 Danni Liu and Jan Niehues. Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15979–...

  7. [7]

    Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLM s

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.778. URLhttps://aclanthology.org/2025. acl-long.778/. Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Mohamed El Amine Seddik, Karttikeya Mangalam, and Noel O’Connor. Do vision and language encoders represent the world...

  8. [8]

    Alex Tamkin, Dan Jurafsky, and Noah D. Goodman. Language through a prism: A spectral approach for multiscale language representations.ArXiv, abs/2011.04823,

  9. [9]

    doi: 10.18653/v1/2024.acl-long.820

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.820. URLhttps://aclanthology.org/2024.acl-long.820/. Romina Wild, Felix Wodaczek, Vittorio Del Tatto, Bingqing Cheng, and Alessandro Laio. Automatic feature selection and weighting in molecular systems using differentiable information imbalance.Nature Communications, 16(1):270, January

  10. [10]

    doi: 10.1038/s41467-024-55449-7

    ISSN 2041-1723. doi: 10.1038/s41467-024-55449-7. Fan Yin, Jayanth Srinivasa, and Kai-Wei Chang. Characterizing truthfulness in large language model generations with local intrinsic dimension. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, NuriaOliver, JonathanScarlett, andFelixBerkenkamp(eds.),Proceedings of the 41st International ...

  11. [11]

    doi: 10.1162/tacl_a_00166

    ISSN 2307-387X. doi: 10.1162/tacl_a_00166. Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. How do large language models handle multilingualism? In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang (eds.),Advances in Neural Information Processing Systems, volume 37, pp. 15296– 15319. Curran Assoc...

  12. [12]

    Shaolin Zhu, Supryadi, Shaoyang Xu, Haoran Sun, Leiyu Pan, Menglong Cui, Jiangcun Du, Renren Jin, António Branco, and Deyi Xiong

    URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/1bd359b32ab8b2a6bbafa1ed2856cf40-Paper-Conference.pdf. Shaolin Zhu, Supryadi, Shaoyang Xu, Haoran Sun, Leiyu Pan, Menglong Cui, Jiangcun Du, Renren Jin, António Branco, and Deyi Xiong. Multilingual large language models: A systematic survey.https: //arxiv.org/abs/2411.11072,

  13. [13]

    A Misalignment of translations erases semantic similarity As a consistency check, Fig 7 shows the Information Imbalance for DeepSeek-V3 and Llama3.1-8b represen- tations using misaligned translations, namely performing a batch-shuffle in one of the datasets. Since the semantic correspondence between sentences is destroyed, the representations are not info...