pith. sign in

arxiv: 2507.01201 · v7 · pith:DOMGPEW2new · submitted 2025-07-01 · 💻 cs.LG · cs.CV

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Pith reviewed 2026-05-21 23:52 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords vision-language alignmentmultimodal modelsautoencodersrepresentation alignmentjoint trainingPlatonic Representation Hypothesisfine-grained distinctions
0
0 comments X

The pith

Joint autoencoders align independently trained vision and language models by coordinating reconstruction and cross-modal objectives on frozen backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Independently trained vision and language models occupy separate representational spaces, yet the Platonic Representation Hypothesis suggests they may still converge on a shared statistical model of reality. The paper introduces the Joint Autoencoder Modulator (JAM) to explicitly induce alignment by placing modality-specific autoencoders on top of frozen unimodal models and training them with both within-modality reconstruction and cross-modal alignment losses. The approach targets fine-grained contextual distinctions where global meaning is shared but compositional details differ, and it does so without paired multimodal training data. Systematic tests vary the alignment objective (including a new multimodal Spread Loss), the layer at which alignment occurs, and the scale of the underlying foundation models. The central result is that JAM produces reliable alignment even when the original models were trained completely separately.

Core claim

The Joint Autoencoder Modulator (JAM) reliably induces alignment between independently trained vision and language representations by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives, including a multimodal Spread Loss that outperforms classic contrastive methods; this holds across choices of layer depth and foundation-model scale.

What carries the argument

Joint Autoencoder Modulator (JAM): modality-specific autoencoders placed atop frozen unimodal models and trained jointly with reconstruction losses inside each modality plus cross-modal alignment losses.

If this is right

  • JAM enables conversion of generalist unimodal models into specialist multimodal models while preserving original unimodal performance.
  • A multimodal Spread Loss outperforms standard contrastive objectives for aligning fine-grained contextual distinctions.
  • Alignment is most effective at particular layer depths and improves with larger foundation model scale.
  • Shared semantics can be actively optimized rather than merely observed after the fact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same autoencoder modulator pattern could be tested on modality pairs beyond vision and language.
  • If alignment emerges without paired data, the method may reduce reliance on expensive multimodal corpora for downstream tasks.
  • Layer-depth and scale findings suggest concrete starting points for practitioners choosing where to attach such modulators.

Load-bearing premise

Coordinated reconstruction and cross-modal alignment objectives applied to modality-specific autoencoders on top of frozen models will produce useful alignment without requiring paired multimodal training data or degrading the original unimodal capabilities.

What would settle it

A controlled experiment in which JAM is applied to a pair of independently trained models and cross-modal retrieval or generation accuracy on fine-grained distinction tasks shows no improvement over the unaligned frozen baselines.

Figures

Figures reproduced from arXiv: 2507.01201 by Been Kim, Lauren Hyoseo Yoon, Yisong Yue.

Figure 1
Figure 1. Figure 1: Illustration of fine-grained contextual understanding from the SugarCrepe dataset [20]. Each image is paired with three types of captions: (i) Match (true positive) captions that correctly describe the image, (ii) Easy non-match captions that are entirely unrelated, and (iii) Hard non-match (hard negative) captions that share global semantics with the true caption but diverge in subtle, fine-grained detail… view at source ↗
Figure 2
Figure 2. Figure 2: Statistical Metrics for Representation Alignment: Across all metrics and model cases, match pairs consistently show higher alignment scores than easy non-match pairs, supporting the hypothesis that unimodal models encode shared global structure. However, hard non-match pairs exhibit similarly high scores. This indicates that while statistical metrics for representations reveal coarse representational compa… view at source ↗
Figure 3
Figure 3. Figure 3: Joint Autoencoder Modulator (JAM) framework. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the Objective of Spread Loss Formulation: Blue and Pink circle correspond to similar context group; Green circles are representations outside of similar context. Figure inspired by [31]. To address this, we introduce Lspread, a contrastive objective that incorporates a notion of context similarity and fine-grained differ￾entiation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: α supervision with respect to the extracted embeddings layers to achieve the best retrieval accuracy for each task. |NL|, |NV | refer to the total layers of each pretrained language, and vision model. nL, nV refer to the layer-depth used for Early, Mid, Late experiments, respectively [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Image-to-Text Retrieval Recall@1 (5 options case) achieved through the α in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment even across independently trained representations, offering both theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Joint Autoencoder Modulator (JAM) to explicitly optimize alignment between independently trained frozen vision and language models. JAM trains modality-specific autoencoders using coordinated reconstruction losses together with cross-modal alignment objectives (including a new multimodal Spread Loss), and evaluates the method along three axes: choice of alignment objective, layer depth for alignment, and foundation-model scale. The central claim is that this procedure reliably induces useful alignment even across disjoint representational spaces, yielding both theoretical insight into shared semantics and practical guidance for building specialist multimodal models from generalist unimodal foundations.

Significance. If the empirical claims hold, the work would offer a post-hoc route to multimodal capability that avoids full retraining of large foundation models, potentially lowering compute barriers. The Spread Loss and the systematic study of layer depth and scale would constitute concrete technical contributions to the literature on representation alignment and the Platonic Representation Hypothesis.

major comments (2)
  1. [Introduction and §3] Introduction and §3 (Method): The description of cross-modal objectives (contrastive loss and the proposed Spread Loss) presupposes explicit image-text correspondences to form positive/negative pairs or regression targets. Yet the introduction and abstract frame JAM as operating on independently trained unimodal models without requiring paired multimodal data for the alignment stage. Because reconstruction losses are unimodal while alignment losses are not, the practical claim that JAM can be applied in purely unpaired settings is not supported by the stated objectives; experiments on standard paired corpora (COCO, Flickr30k) further indicate that paired data is consumed.
  2. [§4] §4 (Experiments): The central claim that JAM 'reliably induces alignment' and outperforms contrastive baselines rests on quantitative results that are not previewed with concrete metrics, error bars, dataset sizes, or ablation tables in the abstract or summary. Without these details it is impossible to judge whether the reported alignment is statistically meaningful or merely reflects the capacity of the added autoencoders rather than genuine cross-modal semantic convergence.
minor comments (2)
  1. [Abstract] Abstract: The three design axes are listed but the key quantitative outcomes (e.g., retrieval accuracy deltas, Spread Loss vs. contrastive margins) are not summarized, reducing the abstract's utility as a standalone overview.
  2. [§3] Notation: Define the embedding spaces of the vision and language autoencoders with consistent symbols (e.g., z_v, z_l) before the first equation in §3; current usage appears to switch between 'latent' and 'modulated' without explicit mapping.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our work. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Introduction and §3] Introduction and §3 (Method): The description of cross-modal objectives (contrastive loss and the proposed Spread Loss) presupposes explicit image-text correspondences to form positive/negative pairs or regression targets. Yet the introduction and abstract frame JAM as operating on independently trained unimodal models without requiring paired multimodal data for the alignment stage. Because reconstruction losses are unimodal while alignment losses are not, the practical claim that JAM can be applied in purely unpaired settings is not supported by the stated objectives; experiments on standard paired corpora (COCO, Flickr30k) further indicate that paired data is consumed.

    Authors: We agree that the cross-modal objectives (contrastive loss and Spread Loss) require paired image-text data to define positives, negatives, or regression targets, while the unimodal reconstruction losses do not. The manuscript's framing that JAM operates 'without requiring paired multimodal data' is imprecise. The vision and language models are independently trained and remain frozen, but the JAM alignment stage consumes paired data from standard corpora. We will revise the abstract and Introduction to explicitly distinguish these points: JAM aligns frozen independently-trained models using paired data for cross-modal objectives, without retraining the foundation models. This addresses the inconsistency. revision: yes

  2. Referee: [§4] §4 (Experiments): The central claim that JAM 'reliably induces alignment' and outperforms contrastive baselines rests on quantitative results that are not previewed with concrete metrics, error bars, dataset sizes, or ablation tables in the abstract or summary. Without these details it is impossible to judge whether the reported alignment is statistically meaningful or merely reflects the capacity of the added autoencoders rather than genuine cross-modal semantic convergence.

    Authors: Section 4 contains the full quantitative results, including specific metrics (e.g., alignment scores, retrieval accuracies), error bars, dataset sizes (COCO ~113k images, Flickr30k), and ablation tables comparing objectives, layers, and scales. To make these claims more immediately evaluable, we will revise the abstract to preview key numerical findings, such as the performance gains of the Spread Loss over contrastive baselines and the scale of the experiments. This will help distinguish genuine cross-modal convergence from autoencoder capacity effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity in JAM derivation chain

full rationale

The paper presents JAM as an empirical training procedure that applies coordinated reconstruction losses (computable separately per modality) plus cross-modal alignment objectives to modality-specific autoencoders atop frozen unimodal models. Claims rest on systematic experimental evaluations across alignment objectives, layer depth, and model scale rather than any closed-form derivation or first-principles result that reduces to its own inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The approach is self-contained against external benchmarks via reported performance on standard corpora.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available, so ledger is incomplete. The Platonic Representation Hypothesis is invoked as background motivation but treated as a hypothesis rather than a proven axiom.

axioms (1)
  • domain assumption Independently trained vision and language models can be aligned via joint autoencoder training with coordinated objectives
    Central premise of the JAM construction stated in the abstract.

pith-pipeline@v0.9.0 · 5735 in / 1212 out tokens · 56635 ms · 2026-05-21T23:52:28.068131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

  1. [1]

    The platonic representation hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. Proceedings of the 41st International Conference of Machine Learning, 2024

  2. [2]

    Republic (De Republica)

    Plato. Republic (De Republica). 375 BC

  3. [3]

    Revisiting model stitching to compare neural representations

    Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. Advances in neural information processing systems, 2021

  4. [4]

    Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

    Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Pro- ceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2443–2449, New York, NY , USA, 2021. Association for ...

  5. [5]

    Kornblith, M

    S. Kornblith, M. Norouzi, H. Lee, and G Hinton. Similarity of neural network representations revisited. Proceedings of the 36th International Conference on Machine Learning , page 3519–3529, 2019

  6. [6]

    Raghu, J

    M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 2017

  7. [7]

    Insights on representational similarity in neural networks with canonical correlation

    Ari S Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems, volume 31, 2018

  8. [8]

    Similarity of neural network models: A survey of functional and representational measures

    Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. Similarity of neural network models: A survey of functional and representational measures. ACM Comput. Surv., 57(9), May 2025

  9. [9]

    Visiolinguistic attention learning for multimodal coreference resolution

    Mahmoud Azab, Xuwang Lyu, Lane Schwartz, and Jeffrey Allen. Visiolinguistic attention learning for multimodal coreference resolution. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1990–2000, 2019

  10. [10]

    Language Is Not All You Need: Aligning Perception with Language Models

    Xiaodong Liu, Yujie Wang, Yichong Xu, Yuwei Chen, et al. Hidden talents of multi- modal models: Can pretrained multimodal models help monomodal tasks? arXiv preprint arXiv:2302.14045, 2023

  11. [11]

    Understanding image representations by measuring their equivariance and equivalence, 2015

    Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence, 2015. 20

  12. [12]

    Gemini: a family of highly capable multimodal models, 2023

    Google. Gemini: a family of highly capable multimodal models, 2023

  13. [13]

    Gpt-4 with vision

    OpenAI. Gpt-4 with vision. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023

  14. [14]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024

  15. [15]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Interna- tional Conference on Machine Learning, 2021

  16. [16]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

  17. [17]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022

  18. [18]

    BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023

  19. [19]

    Deepseek-vl: Scaling vision-language with decoupled multimodal pretraining

    Can Xu, Qiaolin Zeng, Yichong Wu, Yifan Zhang, Qian Li, Wei Wei, et al. Deepseek-vl: Scaling vision-language with decoupled multimodal pretraining. arXiv preprint arXiv:2403.09696, 2024

  20. [20]

    Sugar- crepe: Fixing hackable benchmarks for vision-language compositionality

    Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugar- crepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

  21. [21]

    Gemma Team. Gemma. 2024

  22. [22]

    2 olmo 2 furious, 2024

    Team OLMo. 2 olmo 2 furious, 2024

  23. [23]

    Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...

  24. [24]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  25. [25]

    What regularized auto-encoders learn from the data- generating distribution

    Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data- generating distribution. J. Mach. Learn. Res., 15(1):3563–3593, January 2014

  26. [26]

    Regularized linear autoen- coders recover the principal components, eventually

    Xuchan Bao, James Lucas, Sushant Sachdeva, and Roger Grosse. Regularized linear autoen- coders recover the principal components, eventually. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, 2020

  27. [27]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016

  28. [28]

    Glu variants improve transformer, 2020

    Noam Shazeer. Glu variants improve transformer, 2020

  29. [29]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

  30. [30]

    When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023. 21

  31. [31]

    Chen, Daniel Y

    Mayee F. Chen, Daniel Y . Fu, Avanika Narayan, Michael Zhang, Zhao Song, Kayvon Fatahalian, and Christopher Ré. Perfectly balanced: Improving transfer and robustness of supervised contrastive learning. 2022

  32. [32]

    Fu, Mayee F

    Daniel Y . Fu, Mayee F. Chen, Michael Zhang, Kayvon Fatahalian, and Christopher Ré. The details matter: Preventing class collapse in supervised contrastive learning. 2022

  33. [33]

    Supervised contrastive learning

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020

  34. [34]

    Winoground: Probing vision and language models for visio-linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022

  35. [35]

    Masked Autoencoders Are Scalable Vision Learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021

  36. [36]

    Unsupervised learning of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. 2020

  37. [37]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  38. [38]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

  39. [39]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

  40. [40]

    Openclip, July 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021

  41. [41]

    LAION-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...

  42. [42]

    Sigmoid Loss for Language Image Pre-Training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023

  43. [43]

    Understanding dimensional collapse in contrastive self-supervised learning, 2022

    Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning, 2022

  44. [44]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, page 41–48, 2009

  45. [45]

    Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav), 2018

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav), 2018

  46. [46]

    Network dissection: Quantifying interpretability of deep visual representations

    David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition, 2017. 22

  47. [47]

    Foundation models for time series analysis: A tutorial and survey

    Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. Foundation models for time series analysis: A tutorial and survey. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6555–6565, 2024

  48. [48]

    Totem: Tokenized time series embeddings for general time series analysis

    Sabera J Talukder, Yisong Yue, and Georgia Gkioxari. Totem: Tokenized time series embeddings for general time series analysis. Transactions on Machine Learning Research

  49. [49]

    Moment: A family of open time-series foundation models

    Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024

  50. [50]

    A decoder-only foundation model for time-series forecasting

    Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning, 2024

  51. [51]

    Relations between two sets of variates

    Harold Hotelling. Relations between two sets of variates. 28(3-4):321–377, 1936

  52. [52]

    Reproducing kernel hilbert space, mercer’s theorem, eigenfunctions, nyström method, and use of kernels in machine learning: Tutorial and survey, 2021

    Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Reproducing kernel hilbert space, mercer’s theorem, eigenfunctions, nyström method, and use of kernels in machine learning: Tutorial and survey, 2021

  53. [53]

    Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexan- der J. Smola. A kernel statistical test of independence. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’07, page 585–592, 2007

  54. [54]

    High-dimensional canonical correlation analysis, 2025

    Anna Bykhovskaya and Vadim Gorin. High-dimensional canonical correlation analysis, 2025. 23