pith. machine review for the scientific record. sign in

arxiv: 2604.18572 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords cross-modal alignmentrepresentational convergenceimage-text modelsnearest neighbor analysisdataset scalingmultimodal representationsevaluation sensitivity
0
0 comments X

The pith

Evidence for cross-modal neural network convergence weakens at large scales and realistic conditions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to show that support for the claim that image and text models converge to identical internal representations is fragile. Measures of alignment through mutual nearest neighbors hold up on small sets of roughly a thousand examples but drop sharply once scaled to millions of samples. What alignment remains captures only broad semantic categories rather than matching fine details across models. The one-to-one image-to-caption pairings used in earlier tests also fail to reflect realistic many-to-many data relationships and lower the measured overlap further. Newer language models do not continue the previously reported trend of increasing alignment with vision models.

Core claim

The experimental support for different modality models converging to identical representations relies on fragile evaluation setups. When alignment is measured using mutual nearest neighbors, it holds only on small datasets and breaks down at larger scales, revealing only coarse semantic similarities instead of fine-grained consistency. Additionally, the one-to-one image-caption constraint used in evaluations does not generalize to many-to-many realistic scenarios, and the trend of better language models aligning more with vision does not persist with recent models.

What carries the argument

Mutual nearest-neighbor overlap computed between image and text model embeddings on paired datasets, which serves as the metric for detecting representational convergence.

If this is right

  • Scaling the evaluation dataset to millions of samples causes substantial degradation in measured alignment.
  • Alignment that persists reflects only coarse semantic categories rather than consistent fine details.
  • The one-to-one pairing assumption in tests overestimates alignment compared to many-to-many settings.
  • Reported improvements in alignment with stronger language models do not hold for newer models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the claim holds, then combining modalities during training should prioritize complementary information over forcing identical representations.
  • This suggests developing metrics that capture fine-grained differences rather than relying solely on nearest-neighbor matches.
  • The findings could guide task-specific model selection where modality-unique features provide advantages.

Load-bearing premise

That the amount of mutual nearest-neighbor overlap between image and text representations on large datasets accurately reflects whether their fine-grained structures have converged.

What would settle it

Finding high and stable mutual nearest-neighbor overlap when scaling evaluations to millions of image-text pairs under many-to-many conditions would undermine the argument that prior evidence for convergence is fragile.

Figures

Figures reproduced from arXiv: 2604.18572 by Alexei A. Efros, A. Sophia Koepke, Daniil Zverev, Shiry Ginosar.

Figure 1
Figure 1. Figure 1: Illustration of the mutual nearest neighbor metric used by Huh et al. [ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Nearest-neighbor quality depends on data density. We show 10 within-modality nearest neighbors for image (DINOv2) and text (LLM) embeddings on a sparse WIT-1024 gallery (top) and a denser WIT-1M gallery (bottom). For text queries, retrieved captions and their corresponding reference images are shown. At smaller scale, nearest neighbors are less semantically precise. Nearest￾neighbor structure becomes more … view at source ↗
Figure 3
Figure 3. Figure 3: Mutual kNN text-image feature alignment when scaling from WIT-1024 to WIT-1M. (a) shows the dependence on neighborhood size k, while (b) examines alignment for different LLMs. The observation from [40], that more capable language models align better with vision largely vanishes at WIT-1M scale. happens when it is relaxed. Finally, we perform a trend check to ask whether the predictions from [40] have held … view at source ↗
Figure 4
Figure 4. Figure 4: Scaling the gallery size to 1M (WIT) and 15M (LAION) shows a large drop in mutual [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Nearest-neighbor (k=1) examples with DINOv2 and OpenLlama across gallery scales on WIT-1M. Captions are shown with corresponding images. Mutual kNN matches across modalities are framed green. While the bottom example shows a match at 1M scale, at larger scales each model finds closer but different matches (top three). The mutual kNN alignment scores drop from 0.135 and 0.058 on the 1024-sample gallery to 0… view at source ↗
Figure 6
Figure 6. Figure 6: Nearest-neighbor (k=1) examples with DINOv2 and OpenLlama across gallery scales on LAION-15M. As the gallery densifies, each model finds closer but different matches (top example). The match at 15M (bottom right) is a near-duplicate that survived our deduplication pipeline. largely vanishes. The gap between LLMs narrows considerably, and the relationship between model capability and alignment weakens. This… view at source ↗
Figure 8
Figure 8. Figure 8: Decomposing cross-modal alignment on ImageNet val. (a) shows a qualitative retrieval [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Shared mistake at ipc=1. The query im￾age (bookstore) is matched by both DINOv2 and OpenLlama to a library image. The models agree, but on the wrong answer. The models are individually capable but orga￾nize within-class structure differently (Fig. 8a). At ipc=1, strict alignment (23.1%) actually ex￾ceeds the rate at which both models retrieve a correct-class neighbor (11.7%), meaning the models often agree… view at source ↗
Figure 9
Figure 9. Figure 9: Effect of relaxing the bijective assumption on text-image alignment, using the CycleReward [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of non-bijective (many￾to-many) correspondence between image and captions. The nearest neighbor of a text caption for one image (blue) is a caption for a different image (red). However, the nearest image neigh￾bor for a given image may be another image with the same caption. Specifically, images encode spatial, textural, and perceptual structure that text captures only to a limited extent. On… view at source ↗
Figure 11
Figure 11. Figure 11: Testing whether the alignment-LLM performance trend from [ [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Unimodal mutual kNN alignment as a function of gallery size on WIT-1M. In contrast to cross-modal alignment ( [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cross-modal mutual kNN alignment on images recaptioned using gemini-3-flash-preview (WIT-1M-recap) as the gallery grows to 1M samples. Detailed captions result in overall higher mutual kNN scores, but do not prevent the drop in scores. (a) DINOv2-base and OpenLlama-13b (b) DINOv2-giant and OpenLlama-13b [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Cross-modal mutual kNN alignment as gallery grows from WIT-1024 to WIT-1M for additional, stronger model pairs. Replacing DINOv2-base with the stronger DINOv2-giant and OpenLlama-3b ( [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: ImageNet per-modality retrieval accuracy and cross-modal mutual [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Per-modality retrieval accuracy and cross-modal mutual [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Effect of relaxing the bijective assumption on mutual [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Mutual kNN alignment vs. language benchmark performance for 55 LLMs across four DINOv2 variants, on WikiText, HellaSwag, and GSM8K. Dashed lines show the linear trend fit to the 19 base models from [40]. For WikiText and HellaSwag (top two plots), recent models roughly follow the trend. For GSM8K (bottom plot), the trend is not followed. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Mutual kNN alignment vs. language benchmark performance for 55 LLMs across four DINOv2 variants, on ARC, LogiQA2, and MMLU. As with GSM8K, the alignment-performance trend from [40] does not extrapolate to recent models on any of these reasoning benchmarks. Stronger models do not appear to show higher mutual kNN alignment with DINOv2 features. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Generated image captions for the ImageNet validation set. a) shows the mutual [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Additional nearest-neighbor examples with DINOv2 and OpenLlama-3b for [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Additional nearest-neighbor examples with DINOv2 and OpenLlama-3b for [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Additional nearest-neighbor examples with DINOv2 and OpenLlama-3b for [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Additional nearest-neighbor examples with DINOv2 and OpenLlama-3b for [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗
read the original abstract

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper challenges the Platonic Representation Hypothesis by re-evaluating cross-modal alignment (via mutual nearest-neighbor overlap) on scaled datasets up to millions of samples and in many-to-many image-text regimes. It claims that alignment degrades substantially from the ~1K-sample regime used in prior work, that remaining overlap reflects only coarse semantics rather than fine-grained structure, that one-to-one caption constraints artificially inflate apparent convergence, and that the trend of stronger language models aligning better with vision models fails to hold for newer models. Overall, the authors conclude that evidence for representational convergence is considerably weaker than subsequent literature has assumed.

Significance. If the central claims hold after addressing the metric calibration issues, the work would usefully temper enthusiasm for the Platonic hypothesis and highlight the sensitivity of alignment conclusions to evaluation scale and correspondence assumptions. The manuscript earns credit for performing systematic scaling experiments and for testing the robustness of prior one-to-one findings in more realistic many-to-many settings.

major comments (3)
  1. [§4 (Scaling Experiments)] §4 (Scaling Experiments): The claim that low mutual NN overlap at 1M+ samples demonstrates absence of fine-grained convergence is load-bearing, yet the metric is not calibrated with a positive control. No comparison is reported between mutual NN rates for two same-modality models known to share detailed structure (e.g., independently trained ViTs on identical images) versus cross-modal pairs. Without this, degradation could arise from density effects or metric saturation rather than non-convergence.
  2. [§3.3 (Many-to-Many Regime)] §3.3 (Many-to-Many Regime): The reduction in alignment when moving from one-to-one to many-to-many pairings is presented as further evidence of fragility. However, the expected mutual NN overlap under partial fine-grained alignment is neither modeled nor quantified, leaving the magnitude of the observed drop difficult to interpret.
  3. [Results on LM Scaling Trends] Results on LM Scaling Trends: The assertion that the previously reported trend of stronger language models aligning more closely with vision models does not hold for newer models is central to the critique of subsequent literature. This requires explicit listing of the newer models, exact evaluation protocol, and statistical significance tests to support the conclusion.
minor comments (2)
  1. The abstract and introduction should explicitly cite the original Platonic Representation Hypothesis paper and the specific claims being re-evaluated for reader orientation.
  2. Figure captions and axis labels in the scaling plots would benefit from clearer indication of sample sizes and confidence intervals to aid interpretation of the degradation trend.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight valuable opportunities to strengthen the calibration and interpretability of our results. We have revised the manuscript to incorporate positive controls, quantitative modeling of expected overlaps, and expanded documentation of the LM scaling experiments, as detailed below.

read point-by-point responses
  1. Referee: §4 (Scaling Experiments): The claim that low mutual NN overlap at 1M+ samples demonstrates absence of fine-grained convergence is load-bearing, yet the metric is not calibrated with a positive control. No comparison is reported between mutual NN rates for two same-modality models known to share detailed structure (e.g., independently trained ViTs on identical images) versus cross-modal pairs. Without this, degradation could arise from density effects or metric saturation rather than non-convergence.

    Authors: We agree that a same-modality positive control is necessary to calibrate the metric and rule out density or saturation artifacts. In the revised manuscript we have added this experiment to §4: we compute mutual NN overlap between two independently trained ViT-B/16 models on the identical 1M-image subset and obtain overlap rates of 42–48% (well above the <5% cross-modal rates). This control confirms that the metric remains sensitive to fine-grained structure at scale when such structure exists, supporting our interpretation of the cross-modal results. revision: yes

  2. Referee: §3.3 (Many-to-Many Regime): The reduction in alignment when moving from one-to-one to many-to-many pairings is presented as further evidence of fragility. However, the expected mutual NN overlap under partial fine-grained alignment is neither modeled nor quantified, leaving the magnitude of the observed drop difficult to interpret.

    Authors: We have addressed this by adding a probabilistic simulation in the revised §3.3. We generate synthetic embedding pairs with tunable correlation levels (0.2–0.6) to represent partial fine-grained alignment and compute expected mutual NN rates under the same many-to-many sampling procedure used in the paper. The simulations show that even moderate partial alignment would produce mutual NN overlap 2–3× higher than the observed drop, indicating that the empirical reduction cannot be explained by partial alignment alone. revision: yes

  3. Referee: Results on LM Scaling Trends: The assertion that the previously reported trend of stronger language models aligning more closely with vision models does not hold for newer models is central to the critique of subsequent literature. This requires explicit listing of the newer models, exact evaluation protocol, and statistical significance tests to support the conclusion.

    Authors: We have expanded the relevant results section with an explicit table of all evaluated language models (including Llama-3-8B, Mistral-7B, Gemma-2B, and Phi-3), the precise protocol (mutual NN on the 1M-sample set, 5 random seeds, fixed vision backbone), and bootstrap 95% confidence intervals together with paired t-tests. The tests confirm that the reversal for newer models is statistically significant (p < 0.01) relative to the earlier scaling trend. revision: yes

Circularity Check

0 steps flagged

No significant circularity; independent empirical re-evaluation

full rationale

The paper's claims are grounded in fresh experiments that scale mutual nearest-neighbor overlap measurements to millions of samples and switch to many-to-many correspondence regimes. These are direct, independent observations on new data rather than quantities defined by, fitted to, or renamed from the original Platonic hypothesis. No load-bearing steps reduce to self-citations, self-definitions, or ansatzes imported from the authors' prior work; the critique proceeds by altering the evaluation regime and reporting the resulting degradation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of mutual nearest-neighbor overlap as a proxy for representational convergence and on the assumption that the chosen large-scale datasets preserve the same semantic structure as the original small sets.

axioms (1)
  • domain assumption Mutual nearest neighbors computed on embeddings is a reliable measure of fine-grained representational alignment
    Invoked when interpreting the drop in alignment scores as evidence against convergence.

pith-pipeline@v0.9.0 · 5512 in / 1156 out tokens · 43532 ms · 2026-05-10T04:56:24.545900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 34 canonical work pages · 23 internal anchors

  1. [1]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

  2. [2]

    The surprising effectiveness of test-time training for few-shot learning.arXiv preprint arXiv:2411.07279,

    E. Akyürek, M. Damani, A. Zweiger, L. Qiu, H. Guo, J. Pari, Y . Kim, and J. Andreas. The surpris- ing effectiveness of test-time training for few-shot learning.arXiv preprint arXiv:2411.07279, 2024

  3. [3]

    Cycle consistency as reward: Learning image- text alignment without human preferences.arXiv preprint arXiv:2506.02095, 2025

    H. Bahng, C. Chan, F. Durand, and P. Isola. Cycle consistency as reward: Learning image-text alignment without human preferences.arXiv preprint arXiv:2506.02095, 2025

  4. [4]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    Balestriero et al

    R. Balestriero et al. A spline theory of deep learning. InICML, 2018

  6. [6]

    Bansal, P

    Y . Bansal, P. Nakkiran, and B. Barak. Revisiting model stitching to compare neural representa- tions. InNeurIPS, 2021

  7. [7]

    E. M. Bender and A. Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

  8. [8]

    Browning and Y

    J. Browning and Y . LeCun. Ai and the limits of language.Noema Magazine, 2022

  9. [9]

    W. Chai, E. Song, Y . Du, C. Meng, V . Madhavan, O. Bar-Tal, J.-N. Hwang, S. Xie, and C. D. Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. In ICLR, 2025

  10. [10]

    F. Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

  11. [11]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  12. [12]

    Cover and P

    T. Cover and P. Hart. Nearest neighbor pattern classification.IEEE transactions on information theory, 1967

  13. [13]

    G. Dar. mini-vec2vec: Scaling universal geometry alignment with linear transformations.arXiv preprint arXiv:2510.02348, 2025. 12

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, D. Guo, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  15. [15]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

  16. [16]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

  17. [17]

    The Faiss library

    M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou. The faiss library.arXiv preprint arXiv:2401.08281, 2024

  18. [18]

    Dravid, Y

    A. Dravid, Y . Gandelsman, A. A. Efros, and A. Shocher. Rosetta neurons: Mining the common units in a model zoo. InICCV, 2023

  19. [19]

    S. Edelman. Representation is representation of similarities.Behavioral and brain sciences, 1998

  20. [20]

    L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness.Zenodo, 07 2024. doi: 10.5281/zenodo.12608602. URLhttp...

  21. [21]

    G. D. Gemma Team. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  22. [22]

    G. D. Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  23. [23]

    G. D. Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  24. [24]

    Geng and H

    X. Geng and H. Liu. Openllama: An open reproduction of llama, 2023. URLhttps://github. com/openlm-research/open_llama

  25. [25]

    J. J. Gibson.The Ecological Approach to Visual Perception. Houghton Mifflin, Boston, 1979. ISBN 978-0898593019

  26. [26]

    Gokaslan and V

    A. Gokaslan and V . Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

  27. [27]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  28. [28]

    Groeneveld, I

    D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y . Wang, S. Arora, D. Atkinson, R. Authur, K. R. Chandu, A. Cohan, J. Dumas, Y . Elazar, Y . Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V . Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, ...

  29. [29]

    Gröger, S

    F. Gröger, S. Wen, and M. Brbi´c. Revisiting the platonic representation hypothesis: An aris- totelian view.arXiv preprint arXiv:2602.14486, 2026

  30. [30]

    S. Gu, C. Clark, and A. Kembhavi. I can’t believe there’s no images! learning visual tasks using only language supervision. InICCV, 2023

  31. [31]

    Gupta, S

    S. Gupta, S. Kansal, S. Jegelka, P. Isola, and V . Garg. Canonicalizing multimodal contrastive representation learning. InICLR, 2026. 13

  32. [32]

    Hadgi, L

    S. Hadgi, L. Moschella, A. Santilli, D. Gomez, Q. Huang, E. Rodolà, S. Melzi, and M. Ovs- janikov. Escaping plato’s cave: Towards the alignment of 3d and text latent spaces. InCVPR, 2025

  33. [33]

    J. V . Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten, and P. Pietrini. Distributed and overlapping representations of faces and objects in ventral temporal cortex.Science, 2001

  34. [34]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. InICLR, 2021

  35. [35]

    W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  36. [36]

    Hotelling

    H. Hotelling. Relations between two sets of variates. InBreakthroughs in statistics: methodology and distribution. 1992

  37. [37]

    X. Hu, S. Storks, R. L. Lewis, and J. Chai. In-context analogical reasoning with pre-trained language models. InACL, 2023

  38. [38]

    Y . Hu, H. Hua, Z. Yang, W. Shi, N. A. Smith, and J. Luo. Promptcap: Prompt-guided image captioning for vqa with gpt-3. InICCV, 2023

  39. [39]

    Huang, L

    S. Huang, L. Dong, W. Wang, Y . Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V . Chaudhary, S. Som, X. Song, and F. Wei. Language is not all you need: Aligning perception with language models. InNeurIPS, 2023

  40. [40]

    M. Huh, B. Cheung, T. Wang, and P. Isola. The platonic representation hypothesis. InICML, 2024

  41. [41]

    P. Isola. Personal communication, 2025

  42. [42]

    R. Jha, C. Zhang, V . Shmatikov, and J. X. Morris. Harnessing the universal geometry of embeddings. InNeurIPS, 2025

  43. [43]

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. Renard Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

  44. [44]

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  45. [45]

    Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

    J. Jiang, J. Zhou, and Z. Zhu. Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

  46. [46]

    J. J. Koenderink.Sentience. De Clootcrans Press, Trajectum, Netherlands, 2019

  47. [47]

    Kornblith, M

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InICML, 2019

  48. [48]

    Kriegeskorte, M

    N. Kriegeskorte, M. Mur, and P. A. Bandettini. Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neuroscience, 2008

  49. [49]

    Krishna, Y

    R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 2017

  50. [50]

    Kumar, J

    A. Kumar, J. Clune, J. Lehman, and K. O. Stanley. Questioning representational opti- mism in deep learning: The fractured entangled representation hypothesis.arXiv preprint arXiv:2505.11581, 2025

  51. [51]

    LeCun et al

    Y . LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Openreview, 2022. 14

  52. [52]

    Lenc and A

    K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InCVPR, 2015

  53. [53]

    Y . Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft. Convergent learning: Do different neural networks learn the same representations? InICLR, 2016

  54. [54]

    Liang, W

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. InICRA, 2023

  55. [55]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

  56. [56]

    A. H. Liu, S. Subramanian, V . Jouault, A. Sadé, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

  57. [57]

    D. Liu, S. Zhao, L. Zhuo, W. Lin, Y . Xin, X. Li, Q. Qin, Y . Qiao, H. Li, and P. Gao. Lumina- mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2025

  58. [58]

    H. Liu, J. Liu, L. Cui, Z. Teng, N. Duan, M. Zhou, and Y . Zhang. Logiqa2.0: The logicqa dataset for logical reasoning.IEEE Transactions on Audio, Speech, and Language Processing, 2023

  59. [59]

    Maniparambil, R

    M. Maniparambil, R. Akshulakov, Y . A. D. Djilali, S. Narayan, M. E. A. Seddik, K. Mangalam, and N. E. O’Connor. Do vision and language encoders represent the world similarly? InCVPR, 2024

  60. [60]

    Marcos-Manchón and L

    P. Marcos-Manchón and L. Fuentemilla. Shared representations in brains and models reveal a two-route cortical organization during scene perception.arXiv preprint arXiv:2507.13941, 2026

  61. [61]

    Pointer Sentinel Mixture Models

    S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

  62. [62]

    Merullo, L

    J. Merullo, L. Castricato, C. Eickhoff, and E. Pavlick. Linearly mapping from image to text space. InICLR, 2023

  63. [63]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation,

    Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation,

  64. [64]

    URLhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/

  65. [65]

    A. S. Morcos, M. Raghu, and S. Bengio. Insights on representational similarity in neural networks with canonical correlation. InNeurIPS, 2018

  66. [66]

    Moschella, V

    L. Moschella, V . Maiorca, M. Fumero, A. Norelli, F. Locatello, and E. Rodolà. Relative representations enable zero-shot latent space communication. InICLR, 2023

  67. [67]

    Muennighoff, T

    N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, K. Almubarak, S. Albanie, Z. Alyafeai, A. Webson, E. Raff, and C. Raffel. Crosslingual generalization through multitask finetuning. In ACL, 2023

  68. [68]

    T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, N. Lambert, D. Schwenk, O. Tafjord, T. Anderson, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, N. Dziri, A. Ettinger, M. Guerquin, D. Heineman, H. Ivison, P. W. Koh, J. Liu, S. Malik, W. Merrill, L. J. V . Miranda, J. Morrison, T. Murray, C. Nam, J. Poz...

  69. [69]

    Introducing gpt-oss, 2025

    OpenAI. Introducing gpt-oss, 2025. URL https://openai.com/index/ introducing-gpt-oss/

  70. [70]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without super...

  71. [71]

    Pichai, D

    S. Pichai, D. Hassabis, and K. Kavukcuoglu. A new era of intelligence with Gem- ini 3. Google Blog (The Keyword), Nov. 2025. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3/. Accessed: 2026-01-01

  72. [72]

    Republic

    Plato. Republic. c. 375 BC

  73. [73]

    A. C. Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  74. [74]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

  75. [75]

    Research

    I. Research. Granite 3.3 8b base, 2025. URL https://huggingface.co/ibm-granite/ granite-3.3-8b-base

  76. [76]

    E. Rosch. Principles of categorization. In E. Rosch and B. B. Lloyd, editors,Cognition and Categorization, pages 27–48. Lawrence Elbaum Associates, 1978

  77. [77]

    J. Ruan, A. Abudula, X. Liu, B. Li, Y . Li, C. Wang, Y . Fan, Y . Ge, T. Xiao, and J. Zhu. Ndp: Next distribution prediction as a more broad target.arXiv preprint arXiv:2408.17377, 2024

  78. [78]

    Schnaus, N

    D. Schnaus, N. Araslanov, and D. Cremers. It’s a (blind) match! towards vision-language correspondence without parallel data. InCVPR, 2025

  79. [79]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021

  80. [80]

    Singh, R

    A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela. Public multimodal dataset (PMD). URLhttps://huggingface.co/datasets/facebook/pmd

Showing first 80 references.