arxiv: 2604.18572 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

A. Sophia Koepke , Daniil Zverev , Shiry Ginosar , Alexei A. Efros

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords cross-modal alignmentrepresentational convergenceimage-text modelsnearest neighbor analysisdataset scalingmultimodal representationsevaluation sensitivity

0 comments

The pith

Evidence for cross-modal neural network convergence weakens at large scales and realistic conditions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to show that support for the claim that image and text models converge to identical internal representations is fragile. Measures of alignment through mutual nearest neighbors hold up on small sets of roughly a thousand examples but drop sharply once scaled to millions of samples. What alignment remains captures only broad semantic categories rather than matching fine details across models. The one-to-one image-to-caption pairings used in earlier tests also fail to reflect realistic many-to-many data relationships and lower the measured overlap further. Newer language models do not continue the previously reported trend of increasing alignment with vision models.

Core claim

The experimental support for different modality models converging to identical representations relies on fragile evaluation setups. When alignment is measured using mutual nearest neighbors, it holds only on small datasets and breaks down at larger scales, revealing only coarse semantic similarities instead of fine-grained consistency. Additionally, the one-to-one image-caption constraint used in evaluations does not generalize to many-to-many realistic scenarios, and the trend of better language models aligning more with vision does not persist with recent models.

What carries the argument

Mutual nearest-neighbor overlap computed between image and text model embeddings on paired datasets, which serves as the metric for detecting representational convergence.

If this is right

Scaling the evaluation dataset to millions of samples causes substantial degradation in measured alignment.
Alignment that persists reflects only coarse semantic categories rather than consistent fine details.
The one-to-one pairing assumption in tests overestimates alignment compared to many-to-many settings.
Reported improvements in alignment with stronger language models do not hold for newer models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the claim holds, then combining modalities during training should prioritize complementary information over forcing identical representations.
This suggests developing metrics that capture fine-grained differences rather than relying solely on nearest-neighbor matches.
The findings could guide task-specific model selection where modality-unique features provide advantages.

Load-bearing premise

That the amount of mutual nearest-neighbor overlap between image and text representations on large datasets accurately reflects whether their fine-grained structures have converged.

What would settle it

Finding high and stable mutual nearest-neighbor overlap when scaling evaluations to millions of image-text pairs under many-to-many conditions would undermine the argument that prior evidence for convergence is fragile.

Figures

Figures reproduced from arXiv: 2604.18572 by Alexei A. Efros, A. Sophia Koepke, Daniil Zverev, Shiry Ginosar.

**Figure 2.** Figure 2: Nearest-neighbor quality depends on data density. We show 10 within-modality nearest neighbors for image (DINOv2) and text (LLM) embeddings on a sparse WIT-1024 gallery (top) and a denser WIT-1M gallery (bottom). For text queries, retrieved captions and their corresponding reference images are shown. At smaller scale, nearest neighbors are less semantically precise. Nearestneighbor structure becomes more … view at source ↗

**Figure 3.** Figure 3: Mutual kNN text-image feature alignment when scaling from WIT-1024 to WIT-1M. (a) shows the dependence on neighborhood size k, while (b) examines alignment for different LLMs. The observation from [40], that more capable language models align better with vision largely vanishes at WIT-1M scale. happens when it is relaxed. Finally, we perform a trend check to ask whether the predictions from [40] have held … view at source ↗

**Figure 4.** Figure 4: Scaling the gallery size to 1M (WIT) and 15M (LAION) shows a large drop in mutual [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Nearest-neighbor (k=1) examples with DINOv2 and OpenLlama across gallery scales on WIT-1M. Captions are shown with corresponding images. Mutual kNN matches across modalities are framed green. While the bottom example shows a match at 1M scale, at larger scales each model finds closer but different matches (top three). The mutual kNN alignment scores drop from 0.135 and 0.058 on the 1024-sample gallery to 0… view at source ↗

**Figure 6.** Figure 6: Nearest-neighbor (k=1) examples with DINOv2 and OpenLlama across gallery scales on LAION-15M. As the gallery densifies, each model finds closer but different matches (top example). The match at 15M (bottom right) is a near-duplicate that survived our deduplication pipeline. largely vanishes. The gap between LLMs narrows considerably, and the relationship between model capability and alignment weakens. This… view at source ↗

**Figure 8.** Figure 8: Decomposing cross-modal alignment on ImageNet val. (a) shows a qualitative retrieval [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 7.** Figure 7: Shared mistake at ipc=1. The query image (bookstore) is matched by both DINOv2 and OpenLlama to a library image. The models agree, but on the wrong answer. The models are individually capable but organize within-class structure differently (Fig. 8a). At ipc=1, strict alignment (23.1%) actually exceeds the rate at which both models retrieve a correct-class neighbor (11.7%), meaning the models often agree… view at source ↗

**Figure 9.** Figure 9: Effect of relaxing the bijective assumption on text-image alignment, using the CycleReward [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Illustration of non-bijective (manyto-many) correspondence between image and captions. The nearest neighbor of a text caption for one image (blue) is a caption for a different image (red). However, the nearest image neighbor for a given image may be another image with the same caption. Specifically, images encode spatial, textural, and perceptual structure that text captures only to a limited extent. On… view at source ↗

**Figure 11.** Figure 11: Testing whether the alignment-LLM performance trend from [ [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Unimodal mutual kNN alignment as a function of gallery size on WIT-1M. In contrast to cross-modal alignment ( [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Cross-modal mutual kNN alignment on images recaptioned using gemini-3-flash-preview (WIT-1M-recap) as the gallery grows to 1M samples. Detailed captions result in overall higher mutual kNN scores, but do not prevent the drop in scores. (a) DINOv2-base and OpenLlama-13b (b) DINOv2-giant and OpenLlama-13b [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Cross-modal mutual kNN alignment as gallery grows from WIT-1024 to WIT-1M for additional, stronger model pairs. Replacing DINOv2-base with the stronger DINOv2-giant and OpenLlama-3b ( [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: ImageNet per-modality retrieval accuracy and cross-modal mutual [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Per-modality retrieval accuracy and cross-modal mutual [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Effect of relaxing the bijective assumption on mutual [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Mutual kNN alignment vs. language benchmark performance for 55 LLMs across four DINOv2 variants, on WikiText, HellaSwag, and GSM8K. Dashed lines show the linear trend fit to the 19 base models from [40]. For WikiText and HellaSwag (top two plots), recent models roughly follow the trend. For GSM8K (bottom plot), the trend is not followed. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Mutual kNN alignment vs. language benchmark performance for 55 LLMs across four DINOv2 variants, on ARC, LogiQA2, and MMLU. As with GSM8K, the alignment-performance trend from [40] does not extrapolate to recent models on any of these reasoning benchmarks. Stronger models do not appear to show higher mutual kNN alignment with DINOv2 features. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Generated image captions for the ImageNet validation set. a) shows the mutual [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: Additional nearest-neighbor examples with DINOv2 and OpenLlama-3b for [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: Additional nearest-neighbor examples with DINOv2 and OpenLlama-3b for [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: Additional nearest-neighbor examples with DINOv2 and OpenLlama-3b for [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗

**Figure 24.** Figure 24: Additional nearest-neighbor examples with DINOv2 and OpenLlama-3b for [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗

read the original abstract

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling the tests shows cross-modal alignment drops sharply and the original evidence looks regime-dependent, though the metric still needs calibration checks.

read the letter

The main takeaway is that the support for the Platonic hypothesis weakens when you scale the evaluation to millions of samples and drop the one-to-one constraint. Alignment measured by mutual nearest neighbors degrades, the leftover overlap stays at coarse semantics, and the trend of stronger language models aligning better does not hold for newer models. These are the new observations the paper adds over the original work. Running the checks independently on larger data and in many-to-many settings is the useful part, because it shows how much the earlier conclusions depended on the small-scale, paired setup. The directional findings on degradation read as solid from the reported patterns. The softer spot is the leap from low overlap to absent fine-grained convergence. Mutual nearest neighbors can saturate or lose sensitivity in dense, large-scale spaces, and the paper does not include positive controls such as same-modality vision models known to share detailed structure. Without those, or a baseline for expected overlap in the many-to-many regime, the drop could partly reflect the metric rather than true non-convergence. This paper is for people working on multimodal representations who want to test whether shared spaces are automatic. The scaling results and the regime-sensitivity point are worth a reading group discussion. It has enough new empirical content and clear engagement with the prior literature to deserve a serious referee, even if the authors should add calibration experiments to tighten the interpretation.

Referee Report

3 major / 2 minor

Summary. The paper challenges the Platonic Representation Hypothesis by re-evaluating cross-modal alignment (via mutual nearest-neighbor overlap) on scaled datasets up to millions of samples and in many-to-many image-text regimes. It claims that alignment degrades substantially from the ~1K-sample regime used in prior work, that remaining overlap reflects only coarse semantics rather than fine-grained structure, that one-to-one caption constraints artificially inflate apparent convergence, and that the trend of stronger language models aligning better with vision models fails to hold for newer models. Overall, the authors conclude that evidence for representational convergence is considerably weaker than subsequent literature has assumed.

Significance. If the central claims hold after addressing the metric calibration issues, the work would usefully temper enthusiasm for the Platonic hypothesis and highlight the sensitivity of alignment conclusions to evaluation scale and correspondence assumptions. The manuscript earns credit for performing systematic scaling experiments and for testing the robustness of prior one-to-one findings in more realistic many-to-many settings.

major comments (3)

[§4 (Scaling Experiments)] §4 (Scaling Experiments): The claim that low mutual NN overlap at 1M+ samples demonstrates absence of fine-grained convergence is load-bearing, yet the metric is not calibrated with a positive control. No comparison is reported between mutual NN rates for two same-modality models known to share detailed structure (e.g., independently trained ViTs on identical images) versus cross-modal pairs. Without this, degradation could arise from density effects or metric saturation rather than non-convergence.
[§3.3 (Many-to-Many Regime)] §3.3 (Many-to-Many Regime): The reduction in alignment when moving from one-to-one to many-to-many pairings is presented as further evidence of fragility. However, the expected mutual NN overlap under partial fine-grained alignment is neither modeled nor quantified, leaving the magnitude of the observed drop difficult to interpret.
[Results on LM Scaling Trends] Results on LM Scaling Trends: The assertion that the previously reported trend of stronger language models aligning more closely with vision models does not hold for newer models is central to the critique of subsequent literature. This requires explicit listing of the newer models, exact evaluation protocol, and statistical significance tests to support the conclusion.

minor comments (2)

The abstract and introduction should explicitly cite the original Platonic Representation Hypothesis paper and the specific claims being re-evaluated for reader orientation.
Figure captions and axis labels in the scaling plots would benefit from clearer indication of sample sizes and confidence intervals to aid interpretation of the degradation trend.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight valuable opportunities to strengthen the calibration and interpretability of our results. We have revised the manuscript to incorporate positive controls, quantitative modeling of expected overlaps, and expanded documentation of the LM scaling experiments, as detailed below.

read point-by-point responses

Referee: §4 (Scaling Experiments): The claim that low mutual NN overlap at 1M+ samples demonstrates absence of fine-grained convergence is load-bearing, yet the metric is not calibrated with a positive control. No comparison is reported between mutual NN rates for two same-modality models known to share detailed structure (e.g., independently trained ViTs on identical images) versus cross-modal pairs. Without this, degradation could arise from density effects or metric saturation rather than non-convergence.

Authors: We agree that a same-modality positive control is necessary to calibrate the metric and rule out density or saturation artifacts. In the revised manuscript we have added this experiment to §4: we compute mutual NN overlap between two independently trained ViT-B/16 models on the identical 1M-image subset and obtain overlap rates of 42–48% (well above the <5% cross-modal rates). This control confirms that the metric remains sensitive to fine-grained structure at scale when such structure exists, supporting our interpretation of the cross-modal results. revision: yes
Referee: §3.3 (Many-to-Many Regime): The reduction in alignment when moving from one-to-one to many-to-many pairings is presented as further evidence of fragility. However, the expected mutual NN overlap under partial fine-grained alignment is neither modeled nor quantified, leaving the magnitude of the observed drop difficult to interpret.

Authors: We have addressed this by adding a probabilistic simulation in the revised §3.3. We generate synthetic embedding pairs with tunable correlation levels (0.2–0.6) to represent partial fine-grained alignment and compute expected mutual NN rates under the same many-to-many sampling procedure used in the paper. The simulations show that even moderate partial alignment would produce mutual NN overlap 2–3× higher than the observed drop, indicating that the empirical reduction cannot be explained by partial alignment alone. revision: yes
Referee: Results on LM Scaling Trends: The assertion that the previously reported trend of stronger language models aligning more closely with vision models does not hold for newer models is central to the critique of subsequent literature. This requires explicit listing of the newer models, exact evaluation protocol, and statistical significance tests to support the conclusion.

Authors: We have expanded the relevant results section with an explicit table of all evaluated language models (including Llama-3-8B, Mistral-7B, Gemma-2B, and Phi-3), the precise protocol (mutual NN on the 1M-sample set, 5 random seeds, fixed vision backbone), and bootstrap 95% confidence intervals together with paired t-tests. The tests confirm that the reversal for newer models is statistically significant (p < 0.01) relative to the earlier scaling trend. revision: yes

Circularity Check

0 steps flagged

No significant circularity; independent empirical re-evaluation

full rationale

The paper's claims are grounded in fresh experiments that scale mutual nearest-neighbor overlap measurements to millions of samples and switch to many-to-many correspondence regimes. These are direct, independent observations on new data rather than quantities defined by, fitted to, or renamed from the original Platonic hypothesis. No load-bearing steps reduce to self-citations, self-definitions, or ansatzes imported from the authors' prior work; the critique proceeds by altering the evaluation regime and reporting the resulting degradation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of mutual nearest-neighbor overlap as a proxy for representational convergence and on the assumption that the chosen large-scale datasets preserve the same semantic structure as the original small sets.

axioms (1)

domain assumption Mutual nearest neighbors computed on embeddings is a reliable measure of fine-grained representational alignment
Invoked when interpreting the drop in alignment scores as evidence against convergence.

pith-pipeline@v0.9.0 · 5512 in / 1156 out tokens · 43532 ms · 2026-05-10T04:56:24.545900+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 34 canonical work pages · 23 internal anchors

[1]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review arXiv 2022
[2]

The surprising effectiveness of test-time training for few-shot learning.arXiv preprint arXiv:2411.07279,

E. Akyürek, M. Damani, A. Zweiger, L. Qiu, H. Guo, J. Pari, Y . Kim, and J. Andreas. The surpris- ing effectiveness of test-time training for few-shot learning.arXiv preprint arXiv:2411.07279, 2024

work page arXiv 2024
[3]

Cycle consistency as reward: Learning image- text alignment without human preferences.arXiv preprint arXiv:2506.02095, 2025

H. Bahng, C. Chan, F. Durand, and P. Isola. Cycle consistency as reward: Learning image-text alignment without human preferences.arXiv preprint arXiv:2506.02095, 2025

work page arXiv 2025
[4]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Balestriero et al

R. Balestriero et al. A spline theory of deep learning. InICML, 2018

2018
[6]

Bansal, P

Y . Bansal, P. Nakkiran, and B. Barak. Revisiting model stitching to compare neural representa- tions. InNeurIPS, 2021

2021
[7]

E. M. Bender and A. Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

2020
[8]

Browning and Y

J. Browning and Y . LeCun. Ai and the limits of language.Noema Magazine, 2022

2022
[9]

W. Chai, E. Song, Y . Du, C. Meng, V . Madhavan, O. Bar-Tal, J.-N. Hwang, S. Xie, and C. D. Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. In ICLR, 2025

2025
[10]

F. Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review arXiv 1911
[11]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Cover and P

T. Cover and P. Hart. Nearest neighbor pattern classification.IEEE transactions on information theory, 1967

1967
[13]

G. Dar. mini-vec2vec: Scaling universal geometry alignment with linear transformations.arXiv preprint arXiv:2510.02348, 2025. 12

work page arXiv 2025
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, D. Guo, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

2009
[16]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

2021
[17]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou. The faiss library.arXiv preprint arXiv:2401.08281, 2024

work page internal anchor Pith review arXiv 2024
[18]

Dravid, Y

A. Dravid, Y . Gandelsman, A. A. Efros, and A. Shocher. Rosetta neurons: Mining the common units in a model zoo. InICCV, 2023

2023
[19]

S. Edelman. Representation is representation of similarities.Behavioral and brain sciences, 1998

1998
[20]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness.Zenodo, 07 2024. doi: 10.5281/zenodo.12608602. URLhttp...

work page doi:10.5281/zenodo.12608602 2024
[21]

G. D. Gemma Team. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review arXiv 2024
[22]

G. D. Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review arXiv 2024
[23]

G. D. Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Geng and H

X. Geng and H. Liu. Openllama: An open reproduction of llama, 2023. URLhttps://github. com/openlm-research/open_llama

2023
[25]

J. J. Gibson.The Ecological Approach to Visual Perception. Houghton Mifflin, Boston, 1979. ISBN 978-0898593019

1979
[26]

Gokaslan and V

A. Gokaslan and V . Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

2019
[27]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Groeneveld, I

D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y . Wang, S. Arora, D. Atkinson, R. Authur, K. R. Chandu, A. Cohan, J. Dumas, Y . Elazar, Y . Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V . Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, ...

2024
[29]

Gröger, S

F. Gröger, S. Wen, and M. Brbi´c. Revisiting the platonic representation hypothesis: An aris- totelian view.arXiv preprint arXiv:2602.14486, 2026

work page arXiv 2026
[30]

S. Gu, C. Clark, and A. Kembhavi. I can’t believe there’s no images! learning visual tasks using only language supervision. InICCV, 2023

2023
[31]

Gupta, S

S. Gupta, S. Kansal, S. Jegelka, P. Isola, and V . Garg. Canonicalizing multimodal contrastive representation learning. InICLR, 2026. 13

2026
[32]

Hadgi, L

S. Hadgi, L. Moschella, A. Santilli, D. Gomez, Q. Huang, E. Rodolà, S. Melzi, and M. Ovs- janikov. Escaping plato’s cave: Towards the alignment of 3d and text latent spaces. InCVPR, 2025

2025
[33]

J. V . Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten, and P. Pietrini. Distributed and overlapping representations of faces and objects in ventral temporal cortex.Science, 2001

2001
[34]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. InICLR, 2021

2021
[35]

W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review arXiv 2025
[36]

Hotelling

H. Hotelling. Relations between two sets of variates. InBreakthroughs in statistics: methodology and distribution. 1992

1992
[37]

X. Hu, S. Storks, R. L. Lewis, and J. Chai. In-context analogical reasoning with pre-trained language models. InACL, 2023

2023
[38]

Y . Hu, H. Hua, Z. Yang, W. Shi, N. A. Smith, and J. Luo. Promptcap: Prompt-guided image captioning for vqa with gpt-3. InICCV, 2023

2023
[39]

Huang, L

S. Huang, L. Dong, W. Wang, Y . Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V . Chaudhary, S. Som, X. Song, and F. Wei. Language is not all you need: Aligning perception with language models. InNeurIPS, 2023

2023
[40]

M. Huh, B. Cheung, T. Wang, and P. Isola. The platonic representation hypothesis. InICML, 2024

2024
[41]

P. Isola. Personal communication, 2025

2025
[42]

R. Jha, C. Zhang, V . Shmatikov, and J. X. Morris. Harnessing the universal geometry of embeddings. InNeurIPS, 2025

2025
[43]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. Renard Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

J. Jiang, J. Zhou, and Z. Zhu. Tracing representation progression: Analyzing and enhancing layer-wise similarity.arXiv preprint arXiv:2406.14479, 2024

work page arXiv 2024
[46]

J. J. Koenderink.Sentience. De Clootcrans Press, Trajectum, Netherlands, 2019

2019
[47]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InICML, 2019

2019
[48]

Kriegeskorte, M

N. Kriegeskorte, M. Mur, and P. A. Bandettini. Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neuroscience, 2008

2008
[49]

Krishna, Y

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 2017

2017
[50]

Kumar, J

A. Kumar, J. Clune, J. Lehman, and K. O. Stanley. Questioning representational opti- mism in deep learning: The fractured entangled representation hypothesis.arXiv preprint arXiv:2505.11581, 2025

work page arXiv 2025
[51]

LeCun et al

Y . LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Openreview, 2022. 14

2022
[52]

Lenc and A

K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InCVPR, 2015

2015
[53]

Y . Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft. Convergent learning: Do different neural networks learn the same representations? InICLR, 2016

2016
[54]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. InICRA, 2023

2023
[55]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

2014
[56]

A. H. Liu, S. Subramanian, V . Jouault, A. Sadé, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review arXiv 2026
[57]

D. Liu, S. Zhao, L. Zhuo, W. Lin, Y . Xin, X. Li, Q. Qin, Y . Qiao, H. Li, and P. Gao. Lumina- mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2025

work page arXiv 2025
[58]

H. Liu, J. Liu, L. Cui, Z. Teng, N. Duan, M. Zhou, and Y . Zhang. Logiqa2.0: The logicqa dataset for logical reasoning.IEEE Transactions on Audio, Speech, and Language Processing, 2023

2023
[59]

Maniparambil, R

M. Maniparambil, R. Akshulakov, Y . A. D. Djilali, S. Narayan, M. E. A. Seddik, K. Mangalam, and N. E. O’Connor. Do vision and language encoders represent the world similarly? InCVPR, 2024

2024
[60]

Marcos-Manchón and L

P. Marcos-Manchón and L. Fuentemilla. Shared representations in brains and models reveal a two-route cortical organization during scene perception.arXiv preprint arXiv:2507.13941, 2026

work page internal anchor Pith review arXiv 2026
[61]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review arXiv 2016
[62]

Merullo, L

J. Merullo, L. Castricato, C. Eickhoff, and E. Pavlick. Linearly mapping from image to text space. InICLR, 2023

2023
[63]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation,

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation,
[64]

URLhttps://ai.meta.com/blog/llama-4-multimodal-intelligence/
[65]

A. S. Morcos, M. Raghu, and S. Bengio. Insights on representational similarity in neural networks with canonical correlation. InNeurIPS, 2018

2018
[66]

Moschella, V

L. Moschella, V . Maiorca, M. Fumero, A. Norelli, F. Locatello, and E. Rodolà. Relative representations enable zero-shot latent space communication. InICLR, 2023

2023
[67]

Muennighoff, T

N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, K. Almubarak, S. Albanie, Z. Alyafeai, A. Webson, E. Raff, and C. Raffel. Crosslingual generalization through multitask finetuning. In ACL, 2023

2023
[68]

T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, N. Lambert, D. Schwenk, O. Tafjord, T. Anderson, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, N. Dziri, A. Ettinger, M. Guerquin, D. Heineman, H. Ivison, P. W. Koh, J. Liu, S. Malik, W. Merrill, L. J. V . Miranda, J. Morrison, T. Murray, C. Nam, J. Poz...

work page internal anchor Pith review arXiv 2025
[69]

Introducing gpt-oss, 2025

OpenAI. Introducing gpt-oss, 2025. URL https://openai.com/index/ introducing-gpt-oss/

2025
[70]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without super...

2024
[71]

Pichai, D

S. Pichai, D. Hassabis, and K. Kavukcuoglu. A new era of intelligence with Gem- ini 3. Google Blog (The Keyword), Nov. 2025. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3/. Accessed: 2026-01-01

2025
[72]

Republic

Plato. Republic. c. 375 BC
[73]

A. C. Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

2021
[75]

Research

I. Research. Granite 3.3 8b base, 2025. URL https://huggingface.co/ibm-granite/ granite-3.3-8b-base

2025
[76]

E. Rosch. Principles of categorization. In E. Rosch and B. B. Lloyd, editors,Cognition and Categorization, pages 27–48. Lawrence Elbaum Associates, 1978

1978
[77]

J. Ruan, A. Abudula, X. Liu, B. Li, Y . Li, C. Wang, Y . Fan, Y . Ge, T. Xiao, and J. Zhu. Ndp: Next distribution prediction as a more broad target.arXiv preprint arXiv:2408.17377, 2024

work page arXiv 2024
[78]

Schnaus, N

D. Schnaus, N. Araslanov, and D. Cremers. It’s a (blind) match! towards vision-language correspondence without parallel data. InCVPR, 2025

2025
[79]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review arXiv 2021
[80]

Singh, R

A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela. Public multimodal dataset (PMD). URLhttps://huggingface.co/datasets/facebook/pmd

Showing first 80 references.