Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Been Kim; Lauren Hyoseo Yoon; Yisong Yue

arxiv: 2507.01201 · v7 · pith:DOMGPEW2new · submitted 2025-07-01 · 💻 cs.LG · cs.CV

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Lauren Hyoseo Yoon , Yisong Yue , Been Kim This is my paper

Pith reviewed 2026-05-21 23:52 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords vision-language alignmentmultimodal modelsautoencodersrepresentation alignmentjoint trainingPlatonic Representation Hypothesisfine-grained distinctions

0 comments

The pith

Joint autoencoders align independently trained vision and language models by coordinating reconstruction and cross-modal objectives on frozen backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Independently trained vision and language models occupy separate representational spaces, yet the Platonic Representation Hypothesis suggests they may still converge on a shared statistical model of reality. The paper introduces the Joint Autoencoder Modulator (JAM) to explicitly induce alignment by placing modality-specific autoencoders on top of frozen unimodal models and training them with both within-modality reconstruction and cross-modal alignment losses. The approach targets fine-grained contextual distinctions where global meaning is shared but compositional details differ, and it does so without paired multimodal training data. Systematic tests vary the alignment objective (including a new multimodal Spread Loss), the layer at which alignment occurs, and the scale of the underlying foundation models. The central result is that JAM produces reliable alignment even when the original models were trained completely separately.

Core claim

The Joint Autoencoder Modulator (JAM) reliably induces alignment between independently trained vision and language representations by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives, including a multimodal Spread Loss that outperforms classic contrastive methods; this holds across choices of layer depth and foundation-model scale.

What carries the argument

Joint Autoencoder Modulator (JAM): modality-specific autoencoders placed atop frozen unimodal models and trained jointly with reconstruction losses inside each modality plus cross-modal alignment losses.

If this is right

JAM enables conversion of generalist unimodal models into specialist multimodal models while preserving original unimodal performance.
A multimodal Spread Loss outperforms standard contrastive objectives for aligning fine-grained contextual distinctions.
Alignment is most effective at particular layer depths and improves with larger foundation model scale.
Shared semantics can be actively optimized rather than merely observed after the fact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same autoencoder modulator pattern could be tested on modality pairs beyond vision and language.
If alignment emerges without paired data, the method may reduce reliance on expensive multimodal corpora for downstream tasks.
Layer-depth and scale findings suggest concrete starting points for practitioners choosing where to attach such modulators.

Load-bearing premise

Coordinated reconstruction and cross-modal alignment objectives applied to modality-specific autoencoders on top of frozen models will produce useful alignment without requiring paired multimodal training data or degrading the original unimodal capabilities.

What would settle it

A controlled experiment in which JAM is applied to a pair of independently trained models and cross-modal retrieval or generation accuracy on fine-grained distinction tasks shows no improvement over the unaligned frozen baselines.

Figures

Figures reproduced from arXiv: 2507.01201 by Been Kim, Lauren Hyoseo Yoon, Yisong Yue.

**Figure 1.** Figure 1: Illustration of fine-grained contextual understanding from the SugarCrepe dataset [20]. Each image is paired with three types of captions: (i) Match (true positive) captions that correctly describe the image, (ii) Easy non-match captions that are entirely unrelated, and (iii) Hard non-match (hard negative) captions that share global semantics with the true caption but diverge in subtle, fine-grained detail… view at source ↗

**Figure 2.** Figure 2: Statistical Metrics for Representation Alignment: Across all metrics and model cases, match pairs consistently show higher alignment scores than easy non-match pairs, supporting the hypothesis that unimodal models encode shared global structure. However, hard non-match pairs exhibit similarly high scores. This indicates that while statistical metrics for representations reveal coarse representational compa… view at source ↗

**Figure 3.** Figure 3: Joint Autoencoder Modulator (JAM) framework. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the Objective of Spread Loss Formulation: Blue and Pink circle correspond to similar context group; Green circles are representations outside of similar context. Figure inspired by [31]. To address this, we introduce Lspread, a contrastive objective that incorporates a notion of context similarity and fine-grained differentiation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: α supervision with respect to the extracted embeddings layers to achieve the best retrieval accuracy for each task. |NL|, |NV | refer to the total layers of each pretrained language, and vision model. nL, nV refer to the layer-depth used for Early, Mid, Late experiments, respectively [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Image-to-Text Retrieval Recall@1 (5 options case) achieved through the α in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment even across independently trained representations, offering both theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JAM adds autoencoders and a spread loss to pull frozen vision and language models into better alignment, but the cross-modal step still needs paired image-text data.

read the letter

The main point is that this paper gives a concrete recipe for aligning separately trained vision and language models without touching their weights. You stack small autoencoders on each frozen model, train them with reconstruction losses inside each modality plus cross-modal terms, and introduce a multimodal Spread Loss that they claim handles fine-grained cases better than plain contrastive losses. They also check which layers work best for the alignment and how much natural overlap already exists as model scale grows, tying back to the Platonic Representation Hypothesis. Keeping the bases frozen is the practical hook, and the three-axis evaluation gives some usable implementation notes. The spread loss and the layer/scale checks are the clearest additions over prior contrastive or autoencoder alignment work. The experiments appear to use standard paired sets like COCO, which lets them run the cross-modal objectives but directly undercuts any claim that the whole process runs on purely unpaired unimodal corpora. Reconstruction is modality-internal, yet any term that moves one embedding toward the other still needs explicit correspondences. That is the main soft spot: the practical story is narrower than the abstract framing suggests, and downstream task gains or preservation of original unimodal performance would need stronger numbers to land. The approach itself is coherent and the evaluations are systematic enough to be worth referee time. This is for people working on efficient multimodal adaptation or representation alignment who already have access to some paired data. A reader focused on loss design or post-hoc fusion would get value from the comparisons. It deserves a serious referee.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Joint Autoencoder Modulator (JAM) to explicitly optimize alignment between independently trained frozen vision and language models. JAM trains modality-specific autoencoders using coordinated reconstruction losses together with cross-modal alignment objectives (including a new multimodal Spread Loss), and evaluates the method along three axes: choice of alignment objective, layer depth for alignment, and foundation-model scale. The central claim is that this procedure reliably induces useful alignment even across disjoint representational spaces, yielding both theoretical insight into shared semantics and practical guidance for building specialist multimodal models from generalist unimodal foundations.

Significance. If the empirical claims hold, the work would offer a post-hoc route to multimodal capability that avoids full retraining of large foundation models, potentially lowering compute barriers. The Spread Loss and the systematic study of layer depth and scale would constitute concrete technical contributions to the literature on representation alignment and the Platonic Representation Hypothesis.

major comments (2)

[Introduction and §3] Introduction and §3 (Method): The description of cross-modal objectives (contrastive loss and the proposed Spread Loss) presupposes explicit image-text correspondences to form positive/negative pairs or regression targets. Yet the introduction and abstract frame JAM as operating on independently trained unimodal models without requiring paired multimodal data for the alignment stage. Because reconstruction losses are unimodal while alignment losses are not, the practical claim that JAM can be applied in purely unpaired settings is not supported by the stated objectives; experiments on standard paired corpora (COCO, Flickr30k) further indicate that paired data is consumed.
[§4] §4 (Experiments): The central claim that JAM 'reliably induces alignment' and outperforms contrastive baselines rests on quantitative results that are not previewed with concrete metrics, error bars, dataset sizes, or ablation tables in the abstract or summary. Without these details it is impossible to judge whether the reported alignment is statistically meaningful or merely reflects the capacity of the added autoencoders rather than genuine cross-modal semantic convergence.

minor comments (2)

[Abstract] Abstract: The three design axes are listed but the key quantitative outcomes (e.g., retrieval accuracy deltas, Spread Loss vs. contrastive margins) are not summarized, reducing the abstract's utility as a standalone overview.
[§3] Notation: Define the embedding spaces of the vision and language autoencoders with consistent symbols (e.g., z_v, z_l) before the first equation in §3; current usage appears to switch between 'latent' and 'modulated' without explicit mapping.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our work. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Introduction and §3] Introduction and §3 (Method): The description of cross-modal objectives (contrastive loss and the proposed Spread Loss) presupposes explicit image-text correspondences to form positive/negative pairs or regression targets. Yet the introduction and abstract frame JAM as operating on independently trained unimodal models without requiring paired multimodal data for the alignment stage. Because reconstruction losses are unimodal while alignment losses are not, the practical claim that JAM can be applied in purely unpaired settings is not supported by the stated objectives; experiments on standard paired corpora (COCO, Flickr30k) further indicate that paired data is consumed.

Authors: We agree that the cross-modal objectives (contrastive loss and Spread Loss) require paired image-text data to define positives, negatives, or regression targets, while the unimodal reconstruction losses do not. The manuscript's framing that JAM operates 'without requiring paired multimodal data' is imprecise. The vision and language models are independently trained and remain frozen, but the JAM alignment stage consumes paired data from standard corpora. We will revise the abstract and Introduction to explicitly distinguish these points: JAM aligns frozen independently-trained models using paired data for cross-modal objectives, without retraining the foundation models. This addresses the inconsistency. revision: yes
Referee: [§4] §4 (Experiments): The central claim that JAM 'reliably induces alignment' and outperforms contrastive baselines rests on quantitative results that are not previewed with concrete metrics, error bars, dataset sizes, or ablation tables in the abstract or summary. Without these details it is impossible to judge whether the reported alignment is statistically meaningful or merely reflects the capacity of the added autoencoders rather than genuine cross-modal semantic convergence.

Authors: Section 4 contains the full quantitative results, including specific metrics (e.g., alignment scores, retrieval accuracies), error bars, dataset sizes (COCO ~113k images, Flickr30k), and ablation tables comparing objectives, layers, and scales. To make these claims more immediately evaluable, we will revise the abstract to preview key numerical findings, such as the performance gains of the Spread Loss over contrastive baselines and the scale of the experiments. This will help distinguish genuine cross-modal convergence from autoencoder capacity effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity in JAM derivation chain

full rationale

The paper presents JAM as an empirical training procedure that applies coordinated reconstruction losses (computable separately per modality) plus cross-modal alignment objectives to modality-specific autoencoders atop frozen unimodal models. Claims rest on systematic experimental evaluations across alignment objectives, layer depth, and model scale rather than any closed-form derivation or first-principles result that reduces to its own inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The approach is self-contained against external benchmarks via reported performance on standard corpora.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available, so ledger is incomplete. The Platonic Representation Hypothesis is invoked as background motivation but treated as a hypothesis rather than a proven axiom.

axioms (1)

domain assumption Independently trained vision and language models can be aligned via joint autoencoder training with coordinated objectives
Central premise of the JAM construction stated in the abstract.

pith-pipeline@v0.9.0 · 5735 in / 1212 out tokens · 56635 ms · 2026-05-21T23:52:28.068131+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives... multimodal Spread Loss that outperforms classic contrastive methods
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

[1]

The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. Proceedings of the 41st International Conference of Machine Learning, 2024

work page 2024
[2]

Republic (De Republica)

Plato. Republic (De Republica). 375 BC

work page
[3]

Revisiting model stitching to compare neural representations

Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. Advances in neural information processing systems, 2021

work page 2021
[4]

Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Pro- ceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2443–2449, New York, NY , USA, 2021. Association for ...

work page 2021
[5]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G Hinton. Similarity of neural network representations revisited. Proceedings of the 36th International Conference on Machine Learning , page 3519–3529, 2019

work page 2019
[6]

Raghu, J

M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 2017

work page 2017
[7]

Insights on representational similarity in neural networks with canonical correlation

Ari S Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems, volume 31, 2018

work page 2018
[8]

Similarity of neural network models: A survey of functional and representational measures

Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. Similarity of neural network models: A survey of functional and representational measures. ACM Comput. Surv., 57(9), May 2025

work page 2025
[9]

Visiolinguistic attention learning for multimodal coreference resolution

Mahmoud Azab, Xuwang Lyu, Lane Schwartz, and Jeffrey Allen. Visiolinguistic attention learning for multimodal coreference resolution. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1990–2000, 2019

work page 2019
[10]

Language Is Not All You Need: Aligning Perception with Language Models

Xiaodong Liu, Yujie Wang, Yichong Xu, Yuwei Chen, et al. Hidden talents of multi- modal models: Can pretrained multimodal models help monomodal tasks? arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Understanding image representations by measuring their equivariance and equivalence, 2015

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence, 2015. 20

work page 2015
[12]

Gemini: a family of highly capable multimodal models, 2023

Google. Gemini: a family of highly capable multimodal models, 2023

work page 2023
[13]

Gpt-4 with vision

OpenAI. Gpt-4 with vision. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023

work page 2023
[14]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

work page 2024
[15]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Interna- tional Conference on Machine Learning, 2021

work page 2021
[16]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021
[17]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022

work page 2022
[18]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023

work page 2023
[19]

Deepseek-vl: Scaling vision-language with decoupled multimodal pretraining

Can Xu, Qiaolin Zeng, Yichong Wu, Yifan Zhang, Qian Li, Wei Wei, et al. Deepseek-vl: Scaling vision-language with decoupled multimodal pretraining. arXiv preprint arXiv:2403.09696, 2024

work page arXiv 2024
[20]

Sugar- crepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugar- crepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

work page 2023
[21]

Gemma Team. Gemma. 2024

work page 2024
[22]

2 olmo 2 furious, 2024

Team OLMo. 2 olmo 2 furious, 2024

work page 2024
[23]

Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...

work page 2023
[24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[25]

What regularized auto-encoders learn from the data- generating distribution

Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data- generating distribution. J. Mach. Learn. Res., 15(1):3563–3593, January 2014

work page 2014
[26]

Regularized linear autoen- coders recover the principal components, eventually

Xuchan Bao, James Lucas, Sushant Sachdeva, and Roger Grosse. Regularized linear autoen- coders recover the principal components, eventually. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, 2020

work page 2020
[27]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016

work page 2016
[28]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020

work page 2020
[29]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

work page 2023
[30]

When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023. 21

work page 2023
[31]

Chen, Daniel Y

Mayee F. Chen, Daniel Y . Fu, Avanika Narayan, Michael Zhang, Zhao Song, Kayvon Fatahalian, and Christopher Ré. Perfectly balanced: Improving transfer and robustness of supervised contrastive learning. 2022

work page 2022
[32]

Fu, Mayee F

Daniel Y . Fu, Mayee F. Chen, Michael Zhang, Kayvon Fatahalian, and Christopher Ré. The details matter: Preventing class collapse in supervised contrastive learning. 2022

work page 2022
[33]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020

work page arXiv 2004
[34]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022

work page 2022
[35]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. 2020

work page 2020
[37]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021
[38]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

work page 2021
[39]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

work page 2019
[40]

Openclip, July 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021

work page 2021
[41]

LAION-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...

work page 2022
[42]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Understanding dimensional collapse in contrastive self-supervised learning, 2022

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning, 2022

work page 2022
[44]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, page 41–48, 2009

work page 2009
[45]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav), 2018

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav), 2018

work page 2018
[46]

Network dissection: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition, 2017. 22

work page 2017
[47]

Foundation models for time series analysis: A tutorial and survey

Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. Foundation models for time series analysis: A tutorial and survey. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6555–6565, 2024

work page 2024
[48]

Totem: Tokenized time series embeddings for general time series analysis

Sabera J Talukder, Yisong Yue, and Georgia Gkioxari. Totem: Tokenized time series embeddings for general time series analysis. Transactions on Machine Learning Research

work page
[49]

Moment: A family of open time-series foundation models

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024

work page arXiv 2024
[50]

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[51]

Relations between two sets of variates

Harold Hotelling. Relations between two sets of variates. 28(3-4):321–377, 1936

work page 1936
[52]

Reproducing kernel hilbert space, mercer’s theorem, eigenfunctions, nyström method, and use of kernels in machine learning: Tutorial and survey, 2021

Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Reproducing kernel hilbert space, mercer’s theorem, eigenfunctions, nyström method, and use of kernels in machine learning: Tutorial and survey, 2021

work page 2021
[53]

Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexan- der J. Smola. A kernel statistical test of independence. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’07, page 585–592, 2007

work page 2007
[54]

High-dimensional canonical correlation analysis, 2025

Anna Bykhovskaya and Vadim Gorin. High-dimensional canonical correlation analysis, 2025. 23

work page 2025

[1] [1]

The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. Proceedings of the 41st International Conference of Machine Learning, 2024

work page 2024

[2] [2]

Republic (De Republica)

Plato. Republic (De Republica). 375 BC

work page

[3] [3]

Revisiting model stitching to compare neural representations

Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. Advances in neural information processing systems, 2021

work page 2021

[4] [4]

Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Pro- ceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2443–2449, New York, NY , USA, 2021. Association for ...

work page 2021

[5] [5]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G Hinton. Similarity of neural network representations revisited. Proceedings of the 36th International Conference on Machine Learning , page 3519–3529, 2019

work page 2019

[6] [6]

Raghu, J

M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 2017

work page 2017

[7] [7]

Insights on representational similarity in neural networks with canonical correlation

Ari S Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems, volume 31, 2018

work page 2018

[8] [8]

Similarity of neural network models: A survey of functional and representational measures

Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. Similarity of neural network models: A survey of functional and representational measures. ACM Comput. Surv., 57(9), May 2025

work page 2025

[9] [9]

Visiolinguistic attention learning for multimodal coreference resolution

Mahmoud Azab, Xuwang Lyu, Lane Schwartz, and Jeffrey Allen. Visiolinguistic attention learning for multimodal coreference resolution. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1990–2000, 2019

work page 2019

[10] [10]

Language Is Not All You Need: Aligning Perception with Language Models

Xiaodong Liu, Yujie Wang, Yichong Xu, Yuwei Chen, et al. Hidden talents of multi- modal models: Can pretrained multimodal models help monomodal tasks? arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Understanding image representations by measuring their equivariance and equivalence, 2015

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence, 2015. 20

work page 2015

[12] [12]

Gemini: a family of highly capable multimodal models, 2023

Google. Gemini: a family of highly capable multimodal models, 2023

work page 2023

[13] [13]

Gpt-4 with vision

OpenAI. Gpt-4 with vision. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023

work page 2023

[14] [14]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

work page 2024

[15] [15]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Interna- tional Conference on Machine Learning, 2021

work page 2021

[16] [16]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021

[17] [17]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022

work page 2022

[18] [18]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023

work page 2023

[19] [19]

Deepseek-vl: Scaling vision-language with decoupled multimodal pretraining

Can Xu, Qiaolin Zeng, Yichong Wu, Yifan Zhang, Qian Li, Wei Wei, et al. Deepseek-vl: Scaling vision-language with decoupled multimodal pretraining. arXiv preprint arXiv:2403.09696, 2024

work page arXiv 2024

[20] [20]

Sugar- crepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugar- crepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

work page 2023

[21] [21]

Gemma Team. Gemma. 2024

work page 2024

[22] [22]

2 olmo 2 furious, 2024

Team OLMo. 2 olmo 2 furious, 2024

work page 2024

[23] [23]

Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...

work page 2023

[24] [24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[25] [25]

What regularized auto-encoders learn from the data- generating distribution

Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data- generating distribution. J. Mach. Learn. Res., 15(1):3563–3593, January 2014

work page 2014

[26] [26]

Regularized linear autoen- coders recover the principal components, eventually

Xuchan Bao, James Lucas, Sushant Sachdeva, and Roger Grosse. Regularized linear autoen- coders recover the principal components, eventually. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, 2020

work page 2020

[27] [27]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016

work page 2016

[28] [28]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020

work page 2020

[29] [29]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

work page 2023

[30] [30]

When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023. 21

work page 2023

[31] [31]

Chen, Daniel Y

Mayee F. Chen, Daniel Y . Fu, Avanika Narayan, Michael Zhang, Zhao Song, Kayvon Fatahalian, and Christopher Ré. Perfectly balanced: Improving transfer and robustness of supervised contrastive learning. 2022

work page 2022

[32] [32]

Fu, Mayee F

Daniel Y . Fu, Mayee F. Chen, Michael Zhang, Kayvon Fatahalian, and Christopher Ré. The details matter: Preventing class collapse in supervised contrastive learning. 2022

work page 2022

[33] [33]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020

work page arXiv 2004

[34] [34]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022

work page 2022

[35] [35]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. 2020

work page 2020

[37] [37]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021

[38] [38]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

work page 2021

[39] [39]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

work page 2019

[40] [40]

Openclip, July 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021

work page 2021

[41] [41]

LAION-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...

work page 2022

[42] [42]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Understanding dimensional collapse in contrastive self-supervised learning, 2022

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning, 2022

work page 2022

[44] [44]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, page 41–48, 2009

work page 2009

[45] [45]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav), 2018

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav), 2018

work page 2018

[46] [46]

Network dissection: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition, 2017. 22

work page 2017

[47] [47]

Foundation models for time series analysis: A tutorial and survey

Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. Foundation models for time series analysis: A tutorial and survey. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6555–6565, 2024

work page 2024

[48] [48]

Totem: Tokenized time series embeddings for general time series analysis

Sabera J Talukder, Yisong Yue, and Georgia Gkioxari. Totem: Tokenized time series embeddings for general time series analysis. Transactions on Machine Learning Research

work page

[49] [49]

Moment: A family of open time-series foundation models

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024

work page arXiv 2024

[50] [50]

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[51] [51]

Relations between two sets of variates

Harold Hotelling. Relations between two sets of variates. 28(3-4):321–377, 1936

work page 1936

[52] [52]

Reproducing kernel hilbert space, mercer’s theorem, eigenfunctions, nyström method, and use of kernels in machine learning: Tutorial and survey, 2021

Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Reproducing kernel hilbert space, mercer’s theorem, eigenfunctions, nyström method, and use of kernels in machine learning: Tutorial and survey, 2021

work page 2021

[53] [53]

Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexan- der J. Smola. A kernel statistical test of independence. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’07, page 585–592, 2007

work page 2007

[54] [54]

High-dimensional canonical correlation analysis, 2025

Anna Bykhovskaya and Vadim Gorin. High-dimensional canonical correlation analysis, 2025. 23

work page 2025