arxiv: 2605.01325 · v1 · submitted 2026-05-02 · 💻 cs.CV · cs.LG

Recognition: unknown

Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

Muyang Li , Yucheng Liu , Jianbo Ma , Elliot Osborne , Bo Han , Tongliang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:25 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords vision-language modelsmodel selectionGromov-Wasserstein distancevision encoderscross-modal alignmentmultimodal learningembedding similarity

0 comments

The pith

The Gromov-Wasserstein distance between vision and language embeddings predicts optimal vision encoders for VLMs better than size or accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Common ways to pick vision encoders for vision-language models, such as picking the biggest one or the one best at image classification, do not reliably lead to the strongest final models. The paper finds that the structural similarity between the vision encoder and the language model, measured by how well their embedding spaces can be matched using the Gromov-Wasserstein distance, is a much better guide. This distance can be calculated just from the pre-trained models without any joint training. Theory shows it relates to how easily the models can learn to map between vision and language. Tests across dozens of encoders and full training runs confirm it correlates strongly with actual VLM results.

Core claim

The learnability of cross-modality mapping in VLMs can be provably associated with the Gromov-Wasserstein distance between pre-trained vision and language embeddings, and this distance correlates more strongly with final VLM performance than traditional metrics such as model size or zero-shot accuracy.

What carries the argument

Gromov-Wasserstein distance computed between the feature spaces of pre-trained vision encoders and language models, serving as a proxy for structural similarity that aids cross-modal mapping.

If this is right

Vision encoders should be chosen to minimize Gromov-Wasserstein distance to the target language model rather than by scale or standalone accuracy.
Model selection for VLMs can be performed inference-only before any joint training occurs.
Structural alignment across modalities is a critical previously overlooked factor in building effective multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This selection criterion could extend to choosing encoders for other multimodal pairings such as audio-language models.
Vision encoder pre-training objectives might be redesigned to directly minimize this distance to common language models.

Load-bearing premise

That the structural similarity captured by Gromov-Wasserstein distance on pre-trained embeddings remains the dominant factor once the full VLM training objective and data mixture are introduced.

What would settle it

Training a VLM with an encoder that has large Gromov-Wasserstein distance yet achieves top performance, or one with small distance that underperforms after full training, would challenge the central claim.

Figures

Figures reproduced from arXiv: 2605.01325 by Bo Han, Elliot Osborne, Jianbo Ma, Muyang Li, Tongliang Liu, Yucheng Liu.

**Figure 1.** Figure 1: From left to right, we can see the correlation analysis of zero-shot classification accuracy, vision encoder size, and GW distance view at source ↗

**Figure 2.** Figure 2: A toy example to show the intuition of GW distance: view at source ↗

**Figure 3.** Figure 3: Scaling trend of runtime. how generalizable are the choice of vision encoders across different LLMs, if one have spent resources to select the optimal vision encoder for a specific LLM, can such information be transferred to a different LLM without running vision encoder selection again? As shown in view at source ↗

**Figure 4.** Figure 4: Correlation of the vision encoder ranking between view at source ↗

read the original abstract

Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main finding is that Gromov-Wasserstein distance on pre-trained embeddings predicts final VLM performance better than size or zero-shot accuracy, with evidence from 60+ full training runs.

read the letter

The core result is that Gromov-Wasserstein distance between frozen vision and text embeddings gives a stronger signal for which encoder will produce a good VLM than the usual size or ImageNet zero-shot numbers. They tested 19 encoders across more than 60 complete VLM training runs and report markedly higher correlation with downstream performance, plus a theoretical argument tying GW distance to cross-modal learnability. The metric itself is cheap to compute without any VLM training, which is the practical hook. What stands out is the scale of the experiments. Running full trainings instead of relying on proxies makes the correlation claim more credible, and the consistent failure of size and zero-shot baselines is useful to document even if it is not entirely surprising. The work is straightforward to reproduce in principle and targets a real pain point in VLM development. The soft spot is the theoretical link. The abstract calls it provable, but without the derivation it is hard to tell how tight the bound is or how many assumptions it carries. The stress-test concern about post-training adaptation also needs checking: if the VLM stage updates the vision encoder weights at all, the initial GW distance could be measuring a transient property rather than the final alignment. The paper should state clearly whether encoders stay frozen and report any ablation on that point. Minor gaps in experimental controls or citation of related optimal transport work in multimodal settings would be easy to fix. This is for people who actually pick vision encoders for VLMs or study what makes alignment work. It is the kind of empirical paper that deserves referee time because the experiments are substantial and the claim is testable. I would send it out for review.

Referee Report

3 major / 0 minor

Summary. The manuscript presents a study on selecting vision encoders for Vision-Language Models (VLMs) by introducing the Gromov-Wasserstein (GW) distance as a measure of structural similarity between vision and language embeddings. The authors argue that common heuristics like encoder size or zero-shot performance are poor predictors, while GW distance provides a stronger correlation with downstream VLM performance. They support this with analysis of 19 vision encoders, a theoretical argument linking GW to cross-modal mapping learnability, and empirical results from more than 60 complete VLM training runs demonstrating superior predictive power of the proposed metric.

Significance. If the central claims hold, this paper makes a significant contribution by offering a principled, training-free method for vision encoder selection in VLMs, which could substantially reduce computational costs in multimodal model development. The empirical validation across a large number of full training runs (60+) is a notable strength, providing concrete evidence beyond small-scale ablations. Additionally, the attempt to ground the metric in a theoretical association with learnability adds depth, though its rigor needs confirmation. This approach could shift practices in the field toward more informed model selection strategies.

major comments (3)

Theoretical Analysis section: The abstract claims that learnability of cross-modality mapping 'can be provably associated' with the Gromov-Wasserstein distance, yet the provided text lacks the full derivation or key proof steps. This makes it impossible to verify whether the association is rigorous or relies on unstated assumptions about the mapping objective.
Experimental Results (60+ VLM runs): The central empirical claim rests on pre-computed GW distance predicting final performance, but the manuscript does not specify whether vision encoders remain frozen or are updated during VLM training. If encoders are fine-tuned (common in joint cross-modal objectives), the initial structural similarity may become transient, directly challenging the claim that pre-training GW remains the dominant factor (see skeptic note on joint optimization).
Comparison to baselines: The paper states that GW outperforms alternatives like size or zero-shot accuracy with 'much stronger correlation,' but without reporting exact coefficients (e.g., Pearson r or R^{2} values) in a dedicated table or the precise implementation of the 60+ runs (data mixture, VLM architecture, training hyperparameters), the superiority cannot be fully assessed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the paper.

read point-by-point responses

Referee: Theoretical Analysis section: The abstract claims that learnability of cross-modality mapping 'can be provably associated' with the Gromov-Wasserstein distance, yet the provided text lacks the full derivation or key proof steps. This makes it impossible to verify whether the association is rigorous or relies on unstated assumptions about the mapping objective.

Authors: We appreciate the referee pointing out the need for greater rigor in the theoretical section. The manuscript outlines the association by showing that the Gromov-Wasserstein distance bounds the optimal transport cost between vision and language embedding spaces, which directly relates to the sample complexity required for learning a cross-modal mapping under standard assumptions on the alignment objective. However, we agree that the current presentation would benefit from explicit key proof steps and a clearer statement of assumptions. In the revised manuscript, we will expand the Theoretical Analysis section with the full derivation and include a complete proof in the appendix. revision: yes
Referee: Experimental Results (60+ VLM runs): The central empirical claim rests on pre-computed GW distance predicting final performance, but the manuscript does not specify whether vision encoders remain frozen or are updated during VLM training. If encoders are fine-tuned (common in joint cross-modal objectives), the initial structural similarity may become transient, directly challenging the claim that pre-training GW remains the dominant factor (see skeptic note on joint optimization).

Authors: This is a valid concern regarding experimental clarity. In all 60+ VLM training runs described in the paper, the vision encoders are kept entirely frozen, and only the cross-modal alignment module (projector) is trained. This setup is consistent with standard VLM training protocols that aim to leverage pre-trained visual representations without altering them. As a result, the pre-computed GW distance remains a stable predictor. We will explicitly document this design choice, including the training protocol details, in the revised Experimental Results section to eliminate any ambiguity. revision: yes
Referee: Comparison to baselines: The paper states that GW outperforms alternatives like size or zero-shot accuracy with 'much stronger correlation,' but without reporting exact coefficients (e.g., Pearson r or R^{2} values) in a dedicated table or the precise implementation of the 60+ runs (data mixture, VLM architecture, training hyperparameters), the superiority cannot be fully assessed.

Authors: We agree that quantitative precision and reproducibility details are essential for evaluating the claimed superiority. In the revision, we will add a dedicated table that reports the exact Pearson correlation coefficients (r) and R² values comparing GW distance to VLM performance, alongside the same metrics for baselines such as encoder size and zero-shot accuracy. We will also expand the experimental setup subsection to fully specify the VLM architecture, data mixture, training hyperparameters, and other implementation details for the 60+ runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper computes the Gromov-Wasserstein distance directly as an inference-only metric on frozen pre-trained vision encoder embeddings, without any fitting or optimization against VLM performance targets. The theoretical claim of a provable association between GW distance and cross-modality learnability is presented as an independent derivation rather than a post-hoc fit. Empirical validation relies on 60+ separate full VLM training runs to measure correlation, providing external evidence. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the provided derivation steps. The central result remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that Gromov-Wasserstein distance on frozen embeddings faithfully reflects the learnability of the cross-modal mapping under standard VLM objectives; no new entities are introduced and no parameters are fitted to the target VLM performance.

axioms (1)

standard math Gromov-Wasserstein distance quantifies structural dissimilarity between two metric spaces in a way that is invariant to isometries.
Invoked when the paper treats GW distance as a proxy for modality alignment.

pith-pipeline@v0.9.0 · 5554 in / 1273 out tokens · 25173 ms · 2026-05-09T14:25:21.531418+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 15 canonical work pages · 5 internal anchors

[1]

realworldqa.https://x.ai/news/grok-1.5v

X.ai. realworldqa.https://x.ai/news/grok-1.5v. 5
[2]

Multi-label cluster discrimination for visual representation learning

Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, and Jiankang Deng. Multi-label cluster discrimination for visual representation learning. InECCV, pages 428–444. Springer,
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30,
[5]

Scal- able diverse model selection for accessible transfer learn- ing.Advances in Neural Information Processing Systems, 34:19301–19312, 2021

Daniel Bolya, Rohit Mittapalli, and Judy Hoffman. Scal- able diverse model selection for accessible transfer learn- ing.Advances in Neural Information Processing Systems, 34:19301–19312, 2021. 2

2021
[6]

Learning generative models across in- comparable spaces

Charlotte Bunne, David Alvarez-Melis, Andreas Krause, and Stefanie Jegelka. Learning generative models across in- comparable spaces. InInternational conference on machine learning, pages 851–861. PMLR, 2019. 3

2019
[7]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 5

2021
[8]

Reproducible scal- ing laws for contrastive language-image learning.arXiv preprint arXiv:2212.07143, 2022

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning.arXiv preprint arXiv:2212.07143, 2022. 2

work page arXiv 2022
[9]

Llava-more: A comparative study of llms and vi- sual backbones for enhanced visual instruction tuning.arXiv preprint arXiv:2503.15621, 2025

Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cuc- chiara. Llava-more: A comparative study of llms and vi- sual backbones for enhanced visual instruction tuning.arXiv preprint arXiv:2503.15621, 2025. 1

work page arXiv 2025
[10]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InCVPR, pages 91–104, 2025. 1

2025
[11]

Pactran: Pac-bayesian metrics for estimat- ing the transferability of pretrained models to classification tasks

Nan Ding, Xi Chen, Tomer Levinboim, Soravit Changpinyo, and Radu Soricut. Pactran: Pac-bayesian metrics for estimat- ing the transferability of pretrained models to classification tasks. InEuropean Conference on Computer Vision, pages 252–268. Springer, 2022. 2

2022
[12]

arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22

Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data fil- tering networks.arXiv preprint arXiv:2309.17425, 2023. 2

work page arXiv 2023
[13]

Alaya, Aur ´elie Boisbunon, Stanislas Cham- bon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L ´eo Gautheron, Nathalie T.H

R ´emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur ´elie Boisbunon, Stanislas Cham- bon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L ´eo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexand...

2021
[14]

MME: A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. MME: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Trac...

2025
[15]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5

2019
[16]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024. 2, 5, 6, 3

work page Pith review arXiv 2024
[17]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016. 5

2016
[18]

Learning to select pre-trained deep represen- tations with bayesian evidence framework

Yong-Deok Kim, Taewoong Jang, Bohyung Han, and Se- ungjin Choi. Learning to select pre-trained deep represen- tations with bayesian evidence framework. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5318–5326, 2016. 2

2016
[19]

Genot: Entropic (gromov) wasserstein flow matching with applications to single-cell genomics.NIPS, 37:103897– 103944, 2024

Dominik Klein, Th ´eo Uscidda, Fabian Theis, and Marco Cu- turi. Genot: Entropic (gromov) wasserstein flow matching with applications to single-cell genomics.NIPS, 37:103897– 103944, 2024. 2

2024
[20]

Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neu- roscience, 2:249, 2008

Nikolaus Kriegeskorte, Marieke Mur, and Peter A Ban- dettini. Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neu- roscience, 2:249, 2008. 5, 3

2008
[21]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 5

work page internal anchor Pith review arXiv 2023
[22]

Are bigger encoders always better in vision large models?arXiv preprint arXiv:2408.00620, 2024

Bozhou Li, Hao Liang, Zimo Meng, and Wentao Zhang. Are bigger encoders always better in vision large models?arXiv preprint arXiv:2408.00620, 2024. 1

work page arXiv 2024
[23]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

work page internal anchor Pith review arXiv 2024
[24]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 5

2022
[25]

Towards realistic model selection for semi-supervised learning

Muyang Li, Xiaobo Xia, Runze Wu, Fengming Huang, Jun Yu, Bo Han, and Tongliang Liu. Towards realistic model selection for semi-supervised learning. InForty-first Inter- national Conference on Machine Learning, 2024. 5

2024
[26]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5

work page internal anchor Pith review arXiv 2023
[27]

Selecting large language model to fine-tune via rectified scaling law.arXiv preprint arXiv:2402.02314,

Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zi- hao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, and Yitao Liang. Selecting large language model to fine-tune via rectified scaling law.arXiv preprint arXiv:2402.02314,

work page arXiv
[28]

Visual instruction tuning.NIPS, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NIPS, 36:34892–34916, 2023. 1

2023
[29]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 1, 5

2024
[30]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
[31]

A theory of multimodal learning.Advances in Neu- ral Information Processing Systems, 36:57244–57255, 2023

Zhou Lu. A theory of multimodal learning.Advances in Neu- ral Information Processing Systems, 36:57244–57255, 2023. 4, 1

2023
[32]

Gromov–wasserstein distances and the metric approach to object matching.Foundations of com- putational mathematics, 11(4):417–487, 2011

Facundo M ´emoli. Gromov–wasserstein distances and the metric approach to object matching.Foundations of com- putational mathematics, 11(4):417–487, 2011. 2, 3, 4

2011
[33]

Insights on representational similarity in neural networks with canoni- cal correlation.Advances in neural information processing systems, 31, 2018

Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canoni- cal correlation.Advances in neural information processing systems, 31, 2018. 5, 7, 3

2018
[34]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Sre- bro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks.arXiv preprint arXiv:1707.09564, 2017. 5

work page Pith review arXiv 2017
[35]

Leep: A new measure to evaluate transferabil- ity of learned representations

Cuong Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archambeau. Leep: A new measure to evaluate transferabil- ity of learned representations. InInternational Conference on Machine Learning, pages 7294–7305. PMLR, 2020. 2

2020
[36]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Im2text: Describing images using 1 million captioned pho- tographs.Advances in neural information processing sys- tems, 24, 2011

Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned pho- tographs.Advances in neural information processing sys- tems, 24, 2011. 5

2011
[38]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011. 3

2011
[39]

True few- shot learning with language models.Advances in neural in- formation processing systems, 34:11054–11070, 2021

Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few- shot learning with language models.Advances in neural in- formation processing systems, 34:11054–11070, 2021. 5

2021
[40]

Gromov- wasserstein averaging of kernel and distance matrices

Gabriel Peyr ´e, Marco Cuturi, and Justin Solomon. Gromov- wasserstein averaging of kernel and distance matrices. In International conference on machine learning, pages 2664–
[41]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

2021
[42]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 5

2022
[43]

Not all models are equal: Predicting model transferability in a self- challenging fisher space

Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, and Ping Luo. Not all models are equal: Predicting model transferability in a self- challenging fisher space. InEuropean Conference on Com- puter Vision, pages 286–302. Springer, 2022. 2

2022
[44]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5

2019
[45]

Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.NIPS, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.NIPS, 37:87310–87356, 2024. 1, 5

2024
[46]

Transfer- ability and hardness of supervised classification tasks

Anh T Tran, Cuong V Nguyen, and Tal Hassner. Transfer- ability and hardness of supervised classification tasks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1395–1405, 2019. 2

2019
[47]

Ranked from within: Ranking large multimodal models without labels

Weijie Tu, Weijian Deng, Dylan Campbell, Yu Yao, Jiyang Zheng, Tom Gedeon, and Tongliang Liu. Ranked from within: Ranking large multimodal models without labels. arXiv preprint arXiv:2412.06461, 2024. 2

work page arXiv 2024
[48]

Gromov-wasserstein learning for graph matching and node embedding

Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin Duke. Gromov-wasserstein learning for graph matching and node embedding. InInternational conference on machine learning, pages 6932–6941. PMLR,
[49]

Demysti- fying clip data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify- ing clip data.arXiv preprint arXiv:2309.16671, 2023. 2

work page arXiv 2023
[50]

xGen-MM (BLIP-3): A family of open large multi- modal models

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yu- tong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024. 1

work page arXiv 2024
[51]

Logme: Practical assessment of pre-trained models for transfer learning

Kaichao You, Yong Liu, Jianmin Wang, and Mingsheng Long. Logme: Practical assessment of pre-trained models for transfer learning. InInternational Conference on Ma- chine Learning, pages 12133–12143. PMLR, 2021. 2

2021
[52]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 5

2024
[53]

Sigmoid loss for language image pre-training,

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training,
[54]

Deepacg: Co-saliency detection via semantic-aware contrast gromov-wasserstein distance

Kaihua Zhang, Mingliang Dong, Bo Liu, Xiao-Tong Yuan, and Qingshan Liu. Deepacg: Co-saliency detection via semantic-aware contrast gromov-wasserstein distance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13703–13712, 2021. 2

2021
[55]

Assess- ing and learning alignment of unimodal vision and language models

Le Zhang, Qian Yang, and Aishwarya Agrawal. Assess- ing and learning alignment of unimodal vision and language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14604–14614, 2025. 2

2025
[56]

Model spider: Learning to rank pre- trained models efficiently.Advances in Neural Information Processing Systems, 36:13692–13719, 2023

Yi-Kai Zhang, Ting-Ji Huang, Yao-Xiang Ding, De-Chuan Zhan, and Han-Jia Ye. Model spider: Learning to rank pre- trained models efficiently.Advances in Neural Information Processing Systems, 36:13692–13719, 2023. 2

2023
[57]

Chain-of-focus prompt- ing: Leveraging sequential visual cues to prompt large au- toregressive vision models

Jiyang Zheng, Jialiang Shen, Yu Yao, Min Wang, Yang Yang, Dadong Wang, and Tongliang Liu. Chain-of-focus prompt- ing: Leveraging sequential visual cues to prompt large au- toregressive vision models. InThe Thirteenth International Conference on Learning Representations, 2025. 1 Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Dista...

2025
[58]

Theoretical Analysis (Full) Due to space limitations, we provide some technical analy- sis and the full proof of Theorem 1 here. 8.1. Definitions Note the following definition is given in [31], which we put here for completeness. Definition 4(Approximate realizability[31]).Define the ‘approximate realizability’ of a hypothesis classGon a paired datasetD={...
[59]

approximates

Related Works 9.1. Performance Prediction for Model Selection. Due to the prohibitive cost for full-training every single can- didate large models, performance prediction, or transfer- ability estimation, has become an increasing attentive area. However, most of the works in this domain focus on clas- sification tasks only [5, 11, 18, 35, 43, 46, 47, 51, ...
[60]

Experimental Details We summarize the key training configurations of pre- training feature alignment (stage 1) and visual instruction tuning (stage 2) in Table 8. 10.1. Training Configurations Configuration Stage-1 Stage-2 learning rate 2e−3 2e−5 learning scheduler cosine cosine warmup ratio 0.03 0.03 global batch size 256 128 training epoch 1 1 max seque...

2048