Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models
Pith reviewed 2026-05-21 23:52 UTC · model grok-4.3
The pith
Joint autoencoders align independently trained vision and language models by coordinating reconstruction and cross-modal objectives on frozen backbones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Joint Autoencoder Modulator (JAM) reliably induces alignment between independently trained vision and language representations by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives, including a multimodal Spread Loss that outperforms classic contrastive methods; this holds across choices of layer depth and foundation-model scale.
What carries the argument
Joint Autoencoder Modulator (JAM): modality-specific autoencoders placed atop frozen unimodal models and trained jointly with reconstruction losses inside each modality plus cross-modal alignment losses.
If this is right
- JAM enables conversion of generalist unimodal models into specialist multimodal models while preserving original unimodal performance.
- A multimodal Spread Loss outperforms standard contrastive objectives for aligning fine-grained contextual distinctions.
- Alignment is most effective at particular layer depths and improves with larger foundation model scale.
- Shared semantics can be actively optimized rather than merely observed after the fact.
Where Pith is reading between the lines
- The same autoencoder modulator pattern could be tested on modality pairs beyond vision and language.
- If alignment emerges without paired data, the method may reduce reliance on expensive multimodal corpora for downstream tasks.
- Layer-depth and scale findings suggest concrete starting points for practitioners choosing where to attach such modulators.
Load-bearing premise
Coordinated reconstruction and cross-modal alignment objectives applied to modality-specific autoencoders on top of frozen models will produce useful alignment without requiring paired multimodal training data or degrading the original unimodal capabilities.
What would settle it
A controlled experiment in which JAM is applied to a pair of independently trained models and cross-modal retrieval or generation accuracy on fine-grained distinction tasks shows no improvement over the unaligned frozen baselines.
Figures
read the original abstract
Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment even across independently trained representations, offering both theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Joint Autoencoder Modulator (JAM) to explicitly optimize alignment between independently trained frozen vision and language models. JAM trains modality-specific autoencoders using coordinated reconstruction losses together with cross-modal alignment objectives (including a new multimodal Spread Loss), and evaluates the method along three axes: choice of alignment objective, layer depth for alignment, and foundation-model scale. The central claim is that this procedure reliably induces useful alignment even across disjoint representational spaces, yielding both theoretical insight into shared semantics and practical guidance for building specialist multimodal models from generalist unimodal foundations.
Significance. If the empirical claims hold, the work would offer a post-hoc route to multimodal capability that avoids full retraining of large foundation models, potentially lowering compute barriers. The Spread Loss and the systematic study of layer depth and scale would constitute concrete technical contributions to the literature on representation alignment and the Platonic Representation Hypothesis.
major comments (2)
- [Introduction and §3] Introduction and §3 (Method): The description of cross-modal objectives (contrastive loss and the proposed Spread Loss) presupposes explicit image-text correspondences to form positive/negative pairs or regression targets. Yet the introduction and abstract frame JAM as operating on independently trained unimodal models without requiring paired multimodal data for the alignment stage. Because reconstruction losses are unimodal while alignment losses are not, the practical claim that JAM can be applied in purely unpaired settings is not supported by the stated objectives; experiments on standard paired corpora (COCO, Flickr30k) further indicate that paired data is consumed.
- [§4] §4 (Experiments): The central claim that JAM 'reliably induces alignment' and outperforms contrastive baselines rests on quantitative results that are not previewed with concrete metrics, error bars, dataset sizes, or ablation tables in the abstract or summary. Without these details it is impossible to judge whether the reported alignment is statistically meaningful or merely reflects the capacity of the added autoencoders rather than genuine cross-modal semantic convergence.
minor comments (2)
- [Abstract] Abstract: The three design axes are listed but the key quantitative outcomes (e.g., retrieval accuracy deltas, Spread Loss vs. contrastive margins) are not summarized, reducing the abstract's utility as a standalone overview.
- [§3] Notation: Define the embedding spaces of the vision and language autoencoders with consistent symbols (e.g., z_v, z_l) before the first equation in §3; current usage appears to switch between 'latent' and 'modulated' without explicit mapping.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify key aspects of our work. We respond to each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Introduction and §3] Introduction and §3 (Method): The description of cross-modal objectives (contrastive loss and the proposed Spread Loss) presupposes explicit image-text correspondences to form positive/negative pairs or regression targets. Yet the introduction and abstract frame JAM as operating on independently trained unimodal models without requiring paired multimodal data for the alignment stage. Because reconstruction losses are unimodal while alignment losses are not, the practical claim that JAM can be applied in purely unpaired settings is not supported by the stated objectives; experiments on standard paired corpora (COCO, Flickr30k) further indicate that paired data is consumed.
Authors: We agree that the cross-modal objectives (contrastive loss and Spread Loss) require paired image-text data to define positives, negatives, or regression targets, while the unimodal reconstruction losses do not. The manuscript's framing that JAM operates 'without requiring paired multimodal data' is imprecise. The vision and language models are independently trained and remain frozen, but the JAM alignment stage consumes paired data from standard corpora. We will revise the abstract and Introduction to explicitly distinguish these points: JAM aligns frozen independently-trained models using paired data for cross-modal objectives, without retraining the foundation models. This addresses the inconsistency. revision: yes
-
Referee: [§4] §4 (Experiments): The central claim that JAM 'reliably induces alignment' and outperforms contrastive baselines rests on quantitative results that are not previewed with concrete metrics, error bars, dataset sizes, or ablation tables in the abstract or summary. Without these details it is impossible to judge whether the reported alignment is statistically meaningful or merely reflects the capacity of the added autoencoders rather than genuine cross-modal semantic convergence.
Authors: Section 4 contains the full quantitative results, including specific metrics (e.g., alignment scores, retrieval accuracies), error bars, dataset sizes (COCO ~113k images, Flickr30k), and ablation tables comparing objectives, layers, and scales. To make these claims more immediately evaluable, we will revise the abstract to preview key numerical findings, such as the performance gains of the Spread Loss over contrastive baselines and the scale of the experiments. This will help distinguish genuine cross-modal convergence from autoencoder capacity effects. revision: yes
Circularity Check
No significant circularity in JAM derivation chain
full rationale
The paper presents JAM as an empirical training procedure that applies coordinated reconstruction losses (computable separately per modality) plus cross-modal alignment objectives to modality-specific autoencoders atop frozen unimodal models. Claims rest on systematic experimental evaluations across alignment objectives, layer depth, and model scale rather than any closed-form derivation or first-principles result that reduces to its own inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The approach is self-contained against external benchmarks via reported performance on standard corpora.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Independently trained vision and language models can be aligned via joint autoencoder training with coordinated objectives
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives... multimodal Spread Loss that outperforms classic contrastive methods
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The platonic representation hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. Proceedings of the 41st International Conference of Machine Learning, 2024
work page 2024
- [2]
-
[3]
Revisiting model stitching to compare neural representations
Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. Advances in neural information processing systems, 2021
work page 2021
-
[4]
Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning
Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Pro- ceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2443–2449, New York, NY , USA, 2021. Association for ...
work page 2021
-
[5]
S. Kornblith, M. Norouzi, H. Lee, and G Hinton. Similarity of neural network representations revisited. Proceedings of the 36th International Conference on Machine Learning , page 3519–3529, 2019
work page 2019
- [6]
-
[7]
Insights on representational similarity in neural networks with canonical correlation
Ari S Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[8]
Similarity of neural network models: A survey of functional and representational measures
Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. Similarity of neural network models: A survey of functional and representational measures. ACM Comput. Surv., 57(9), May 2025
work page 2025
-
[9]
Visiolinguistic attention learning for multimodal coreference resolution
Mahmoud Azab, Xuwang Lyu, Lane Schwartz, and Jeffrey Allen. Visiolinguistic attention learning for multimodal coreference resolution. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1990–2000, 2019
work page 2019
-
[10]
Language Is Not All You Need: Aligning Perception with Language Models
Xiaodong Liu, Yujie Wang, Yichong Xu, Yuwei Chen, et al. Hidden talents of multi- modal models: Can pretrained multimodal models help monomodal tasks? arXiv preprint arXiv:2302.14045, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Understanding image representations by measuring their equivariance and equivalence, 2015
Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence, 2015. 20
work page 2015
-
[12]
Gemini: a family of highly capable multimodal models, 2023
Google. Gemini: a family of highly capable multimodal models, 2023
work page 2023
-
[13]
OpenAI. Gpt-4 with vision. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023
work page 2023
- [14]
-
[15]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Interna- tional Conference on Machine Learning, 2021
work page 2021
-
[16]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[17]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022
work page 2022
-
[18]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023
work page 2023
-
[19]
Deepseek-vl: Scaling vision-language with decoupled multimodal pretraining
Can Xu, Qiaolin Zeng, Yichong Wu, Yifan Zhang, Qian Li, Wei Wei, et al. Deepseek-vl: Scaling vision-language with decoupled multimodal pretraining. arXiv preprint arXiv:2403.09696, 2024
-
[20]
Sugar- crepe: Fixing hackable benchmarks for vision-language compositionality
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugar- crepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023
work page 2023
-
[21]
Gemma Team. Gemma. 2024
work page 2024
- [22]
-
[23]
Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...
work page 2023
-
[24]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[25]
What regularized auto-encoders learn from the data- generating distribution
Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data- generating distribution. J. Mach. Learn. Res., 15(1):3563–3593, January 2014
work page 2014
-
[26]
Regularized linear autoen- coders recover the principal components, eventually
Xuchan Bao, James Lucas, Sushant Sachdeva, and Roger Grosse. Regularized linear autoen- coders recover the principal components, eventually. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, 2020
work page 2020
-
[27]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016
work page 2016
-
[28]
Glu variants improve transformer, 2020
Noam Shazeer. Glu variants improve transformer, 2020
work page 2020
-
[29]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023
work page 2023
-
[30]
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023. 21
work page 2023
-
[31]
Mayee F. Chen, Daniel Y . Fu, Avanika Narayan, Michael Zhang, Zhao Song, Kayvon Fatahalian, and Christopher Ré. Perfectly balanced: Improving transfer and robustness of supervised contrastive learning. 2022
work page 2022
-
[32]
Daniel Y . Fu, Mayee F. Chen, Michael Zhang, Kayvon Fatahalian, and Christopher Ré. The details matter: Preventing class collapse in supervised contrastive learning. 2022
work page 2022
-
[33]
Supervised contrastive learning
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020
-
[34]
Winoground: Probing vision and language models for visio-linguistic compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022
work page 2022
-
[35]
Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
Unsupervised learning of visual features by contrasting cluster assignments
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. 2020
work page 2020
-
[37]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021
work page 2021
-
[38]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021
work page 2021
-
[39]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019
work page 2019
-
[40]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021
work page 2021
-
[41]
LAION-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...
work page 2022
-
[42]
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Understanding dimensional collapse in contrastive self-supervised learning, 2022
Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning, 2022
work page 2022
-
[44]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, page 41–48, 2009
work page 2009
-
[45]
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav), 2018
work page 2018
-
[46]
Network dissection: Quantifying interpretability of deep visual representations
David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition, 2017. 22
work page 2017
-
[47]
Foundation models for time series analysis: A tutorial and survey
Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. Foundation models for time series analysis: A tutorial and survey. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6555–6565, 2024
work page 2024
-
[48]
Totem: Tokenized time series embeddings for general time series analysis
Sabera J Talukder, Yisong Yue, and Georgia Gkioxari. Totem: Tokenized time series embeddings for general time series analysis. Transactions on Machine Learning Research
-
[49]
Moment: A family of open time-series foundation models
Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024
-
[50]
A decoder-only foundation model for time-series forecasting
Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[51]
Relations between two sets of variates
Harold Hotelling. Relations between two sets of variates. 28(3-4):321–377, 1936
work page 1936
-
[52]
Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Reproducing kernel hilbert space, mercer’s theorem, eigenfunctions, nyström method, and use of kernels in machine learning: Tutorial and survey, 2021
work page 2021
-
[53]
Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexan- der J. Smola. A kernel statistical test of independence. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’07, page 585–592, 2007
work page 2007
-
[54]
High-dimensional canonical correlation analysis, 2025
Anna Bykhovskaya and Vadim Gorin. High-dimensional canonical correlation analysis, 2025. 23
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.