Recognition: unknown
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Pith reviewed 2026-05-09 14:25 UTC · model grok-4.3
The pith
The Gromov-Wasserstein distance between vision and language embeddings predicts optimal vision encoders for VLMs better than size or accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The learnability of cross-modality mapping in VLMs can be provably associated with the Gromov-Wasserstein distance between pre-trained vision and language embeddings, and this distance correlates more strongly with final VLM performance than traditional metrics such as model size or zero-shot accuracy.
What carries the argument
Gromov-Wasserstein distance computed between the feature spaces of pre-trained vision encoders and language models, serving as a proxy for structural similarity that aids cross-modal mapping.
If this is right
- Vision encoders should be chosen to minimize Gromov-Wasserstein distance to the target language model rather than by scale or standalone accuracy.
- Model selection for VLMs can be performed inference-only before any joint training occurs.
- Structural alignment across modalities is a critical previously overlooked factor in building effective multimodal systems.
Where Pith is reading between the lines
- This selection criterion could extend to choosing encoders for other multimodal pairings such as audio-language models.
- Vision encoder pre-training objectives might be redesigned to directly minimize this distance to common language models.
Load-bearing premise
That the structural similarity captured by Gromov-Wasserstein distance on pre-trained embeddings remains the dominant factor once the full VLM training objective and data mixture are introduced.
What would settle it
Training a VLM with an encoder that has large Gromov-Wasserstein distance yet achieves top performance, or one with small distance that underperforms after full training, would challenge the central claim.
Figures
read the original abstract
Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a study on selecting vision encoders for Vision-Language Models (VLMs) by introducing the Gromov-Wasserstein (GW) distance as a measure of structural similarity between vision and language embeddings. The authors argue that common heuristics like encoder size or zero-shot performance are poor predictors, while GW distance provides a stronger correlation with downstream VLM performance. They support this with analysis of 19 vision encoders, a theoretical argument linking GW to cross-modal mapping learnability, and empirical results from more than 60 complete VLM training runs demonstrating superior predictive power of the proposed metric.
Significance. If the central claims hold, this paper makes a significant contribution by offering a principled, training-free method for vision encoder selection in VLMs, which could substantially reduce computational costs in multimodal model development. The empirical validation across a large number of full training runs (60+) is a notable strength, providing concrete evidence beyond small-scale ablations. Additionally, the attempt to ground the metric in a theoretical association with learnability adds depth, though its rigor needs confirmation. This approach could shift practices in the field toward more informed model selection strategies.
major comments (3)
- Theoretical Analysis section: The abstract claims that learnability of cross-modality mapping 'can be provably associated' with the Gromov-Wasserstein distance, yet the provided text lacks the full derivation or key proof steps. This makes it impossible to verify whether the association is rigorous or relies on unstated assumptions about the mapping objective.
- Experimental Results (60+ VLM runs): The central empirical claim rests on pre-computed GW distance predicting final performance, but the manuscript does not specify whether vision encoders remain frozen or are updated during VLM training. If encoders are fine-tuned (common in joint cross-modal objectives), the initial structural similarity may become transient, directly challenging the claim that pre-training GW remains the dominant factor (see skeptic note on joint optimization).
- Comparison to baselines: The paper states that GW outperforms alternatives like size or zero-shot accuracy with 'much stronger correlation,' but without reporting exact coefficients (e.g., Pearson r or R^{2} values) in a dedicated table or the precise implementation of the 60+ runs (data mixture, VLM architecture, training hyperparameters), the superiority cannot be fully assessed.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: Theoretical Analysis section: The abstract claims that learnability of cross-modality mapping 'can be provably associated' with the Gromov-Wasserstein distance, yet the provided text lacks the full derivation or key proof steps. This makes it impossible to verify whether the association is rigorous or relies on unstated assumptions about the mapping objective.
Authors: We appreciate the referee pointing out the need for greater rigor in the theoretical section. The manuscript outlines the association by showing that the Gromov-Wasserstein distance bounds the optimal transport cost between vision and language embedding spaces, which directly relates to the sample complexity required for learning a cross-modal mapping under standard assumptions on the alignment objective. However, we agree that the current presentation would benefit from explicit key proof steps and a clearer statement of assumptions. In the revised manuscript, we will expand the Theoretical Analysis section with the full derivation and include a complete proof in the appendix. revision: yes
-
Referee: Experimental Results (60+ VLM runs): The central empirical claim rests on pre-computed GW distance predicting final performance, but the manuscript does not specify whether vision encoders remain frozen or are updated during VLM training. If encoders are fine-tuned (common in joint cross-modal objectives), the initial structural similarity may become transient, directly challenging the claim that pre-training GW remains the dominant factor (see skeptic note on joint optimization).
Authors: This is a valid concern regarding experimental clarity. In all 60+ VLM training runs described in the paper, the vision encoders are kept entirely frozen, and only the cross-modal alignment module (projector) is trained. This setup is consistent with standard VLM training protocols that aim to leverage pre-trained visual representations without altering them. As a result, the pre-computed GW distance remains a stable predictor. We will explicitly document this design choice, including the training protocol details, in the revised Experimental Results section to eliminate any ambiguity. revision: yes
-
Referee: Comparison to baselines: The paper states that GW outperforms alternatives like size or zero-shot accuracy with 'much stronger correlation,' but without reporting exact coefficients (e.g., Pearson r or R^{2} values) in a dedicated table or the precise implementation of the 60+ runs (data mixture, VLM architecture, training hyperparameters), the superiority cannot be fully assessed.
Authors: We agree that quantitative precision and reproducibility details are essential for evaluating the claimed superiority. In the revision, we will add a dedicated table that reports the exact Pearson correlation coefficients (r) and R² values comparing GW distance to VLM performance, alongside the same metrics for baselines such as encoder size and zero-shot accuracy. We will also expand the experimental setup subsection to fully specify the VLM architecture, data mixture, training hyperparameters, and other implementation details for the 60+ runs. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper computes the Gromov-Wasserstein distance directly as an inference-only metric on frozen pre-trained vision encoder embeddings, without any fitting or optimization against VLM performance targets. The theoretical claim of a provable association between GW distance and cross-modality learnability is presented as an independent derivation rather than a post-hoc fit. Empirical validation relies on 60+ separate full VLM training runs to measure correlation, providing external evidence. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the provided derivation steps. The central result remains self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Gromov-Wasserstein distance quantifies structural dissimilarity between two metric spaces in a way that is invariant to isometries.
Reference graph
Works this paper leans on
-
[1]
realworldqa.https://x.ai/news/grok-1.5v
X.ai. realworldqa.https://x.ai/news/grok-1.5v. 5
-
[2]
Multi-label cluster discrimination for visual representation learning
Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, and Jiankang Deng. Multi-label cluster discrimination for visual representation learning. InECCV, pages 428–444. Springer,
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Spectrally-normalized margin bounds for neural networks
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30,
-
[5]
Scal- able diverse model selection for accessible transfer learn- ing.Advances in Neural Information Processing Systems, 34:19301–19312, 2021
Daniel Bolya, Rohit Mittapalli, and Judy Hoffman. Scal- able diverse model selection for accessible transfer learn- ing.Advances in Neural Information Processing Systems, 34:19301–19312, 2021. 2
2021
-
[6]
Learning generative models across in- comparable spaces
Charlotte Bunne, David Alvarez-Melis, Andreas Krause, and Stefanie Jegelka. Learning generative models across in- comparable spaces. InInternational conference on machine learning, pages 851–861. PMLR, 2019. 3
2019
-
[7]
Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 5
2021
-
[8]
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning.arXiv preprint arXiv:2212.07143, 2022. 2
-
[9]
Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cuc- chiara. Llava-more: A comparative study of llms and vi- sual backbones for enhanced visual instruction tuning.arXiv preprint arXiv:2503.15621, 2025. 1
-
[10]
Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InCVPR, pages 91–104, 2025. 1
2025
-
[11]
Pactran: Pac-bayesian metrics for estimat- ing the transferability of pretrained models to classification tasks
Nan Ding, Xi Chen, Tomer Levinboim, Soravit Changpinyo, and Radu Soricut. Pactran: Pac-bayesian metrics for estimat- ing the transferability of pretrained models to classification tasks. InEuropean Conference on Computer Vision, pages 252–268. Springer, 2022. 2
2022
-
[12]
arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22
Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data fil- tering networks.arXiv preprint arXiv:2309.17425, 2023. 2
-
[13]
Alaya, Aur ´elie Boisbunon, Stanislas Cham- bon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L ´eo Gautheron, Nathalie T.H
R ´emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur ´elie Boisbunon, Stanislas Cham- bon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L ´eo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexand...
2021
-
[14]
MME: A comprehensive evaluation benchmark for multimodal large language models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. MME: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Trac...
2025
-
[15]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5
2019
-
[16]
The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024. 2, 5, 6, 3
work page Pith review arXiv 2024
-
[17]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016. 5
2016
-
[18]
Learning to select pre-trained deep represen- tations with bayesian evidence framework
Yong-Deok Kim, Taewoong Jang, Bohyung Han, and Se- ungjin Choi. Learning to select pre-trained deep represen- tations with bayesian evidence framework. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5318–5326, 2016. 2
2016
-
[19]
Genot: Entropic (gromov) wasserstein flow matching with applications to single-cell genomics.NIPS, 37:103897– 103944, 2024
Dominik Klein, Th ´eo Uscidda, Fabian Theis, and Marco Cu- turi. Genot: Entropic (gromov) wasserstein flow matching with applications to single-cell genomics.NIPS, 37:103897– 103944, 2024. 2
2024
-
[20]
Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neu- roscience, 2:249, 2008
Nikolaus Kriegeskorte, Marieke Mur, and Peter A Ban- dettini. Representational similarity analysis-connecting the branches of systems neuroscience.Frontiers in systems neu- roscience, 2:249, 2008. 5, 3
2008
-
[21]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 5
work page internal anchor Pith review arXiv 2023
-
[22]
Are bigger encoders always better in vision large models?arXiv preprint arXiv:2408.00620, 2024
Bozhou Li, Hao Liang, Zimo Meng, and Wentao Zhang. Are bigger encoders always better in vision large models?arXiv preprint arXiv:2408.00620, 2024. 1
-
[23]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[24]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 5
2022
-
[25]
Towards realistic model selection for semi-supervised learning
Muyang Li, Xiaobo Xia, Runze Wu, Fengming Huang, Jun Yu, Bo Han, and Tongliang Liu. Towards realistic model selection for semi-supervised learning. InForty-first Inter- national Conference on Machine Learning, 2024. 5
2024
-
[26]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5
work page internal anchor Pith review arXiv 2023
-
[27]
Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zi- hao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, and Yitao Liang. Selecting large language model to fine-tune via rectified scaling law.arXiv preprint arXiv:2402.02314,
-
[28]
Visual instruction tuning.NIPS, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NIPS, 36:34892–34916, 2023. 1
2023
-
[29]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 1, 5
2024
-
[30]
Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
-
[31]
A theory of multimodal learning.Advances in Neu- ral Information Processing Systems, 36:57244–57255, 2023
Zhou Lu. A theory of multimodal learning.Advances in Neu- ral Information Processing Systems, 36:57244–57255, 2023. 4, 1
2023
-
[32]
Gromov–wasserstein distances and the metric approach to object matching.Foundations of com- putational mathematics, 11(4):417–487, 2011
Facundo M ´emoli. Gromov–wasserstein distances and the metric approach to object matching.Foundations of com- putational mathematics, 11(4):417–487, 2011. 2, 3, 4
2011
-
[33]
Insights on representational similarity in neural networks with canoni- cal correlation.Advances in neural information processing systems, 31, 2018
Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canoni- cal correlation.Advances in neural information processing systems, 31, 2018. 5, 7, 3
2018
-
[34]
A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Sre- bro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks.arXiv preprint arXiv:1707.09564, 2017. 5
work page Pith review arXiv 2017
-
[35]
Leep: A new measure to evaluate transferabil- ity of learned representations
Cuong Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archambeau. Leep: A new measure to evaluate transferabil- ity of learned representations. InInternational Conference on Machine Learning, pages 7294–7305. PMLR, 2020. 2
2020
-
[36]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Im2text: Describing images using 1 million captioned pho- tographs.Advances in neural information processing sys- tems, 24, 2011
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned pho- tographs.Advances in neural information processing sys- tems, 24, 2011. 5
2011
-
[38]
Pedregosa, G
F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011. 3
2011
-
[39]
True few- shot learning with language models.Advances in neural in- formation processing systems, 34:11054–11070, 2021
Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few- shot learning with language models.Advances in neural in- formation processing systems, 34:11054–11070, 2021. 5
2021
-
[40]
Gromov- wasserstein averaging of kernel and distance matrices
Gabriel Peyr ´e, Marco Cuturi, and Justin Solomon. Gromov- wasserstein averaging of kernel and distance matrices. In International conference on machine learning, pages 2664–
-
[41]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2
2021
-
[42]
Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 5
2022
-
[43]
Not all models are equal: Predicting model transferability in a self- challenging fisher space
Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, and Ping Luo. Not all models are equal: Predicting model transferability in a self- challenging fisher space. InEuropean Conference on Com- puter Vision, pages 286–302. Springer, 2022. 2
2022
-
[44]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5
2019
-
[45]
Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.NIPS, 37:87310–87356, 2024
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.NIPS, 37:87310–87356, 2024. 1, 5
2024
-
[46]
Transfer- ability and hardness of supervised classification tasks
Anh T Tran, Cuong V Nguyen, and Tal Hassner. Transfer- ability and hardness of supervised classification tasks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1395–1405, 2019. 2
2019
-
[47]
Ranked from within: Ranking large multimodal models without labels
Weijie Tu, Weijian Deng, Dylan Campbell, Yu Yao, Jiyang Zheng, Tom Gedeon, and Tongliang Liu. Ranked from within: Ranking large multimodal models without labels. arXiv preprint arXiv:2412.06461, 2024. 2
-
[48]
Gromov-wasserstein learning for graph matching and node embedding
Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin Duke. Gromov-wasserstein learning for graph matching and node embedding. InInternational conference on machine learning, pages 6932–6941. PMLR,
-
[49]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify- ing clip data.arXiv preprint arXiv:2309.16671, 2023. 2
-
[50]
xGen-MM (BLIP-3): A family of open large multi- modal models
Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yu- tong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024. 1
-
[51]
Logme: Practical assessment of pre-trained models for transfer learning
Kaichao You, Yong Liu, Jianmin Wang, and Mingsheng Long. Logme: Practical assessment of pre-trained models for transfer learning. InInternational Conference on Ma- chine Learning, pages 12133–12143. PMLR, 2021. 2
2021
-
[52]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 5
2024
-
[53]
Sigmoid loss for language image pre-training,
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training,
-
[54]
Deepacg: Co-saliency detection via semantic-aware contrast gromov-wasserstein distance
Kaihua Zhang, Mingliang Dong, Bo Liu, Xiao-Tong Yuan, and Qingshan Liu. Deepacg: Co-saliency detection via semantic-aware contrast gromov-wasserstein distance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13703–13712, 2021. 2
2021
-
[55]
Assess- ing and learning alignment of unimodal vision and language models
Le Zhang, Qian Yang, and Aishwarya Agrawal. Assess- ing and learning alignment of unimodal vision and language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14604–14614, 2025. 2
2025
-
[56]
Model spider: Learning to rank pre- trained models efficiently.Advances in Neural Information Processing Systems, 36:13692–13719, 2023
Yi-Kai Zhang, Ting-Ji Huang, Yao-Xiang Ding, De-Chuan Zhan, and Han-Jia Ye. Model spider: Learning to rank pre- trained models efficiently.Advances in Neural Information Processing Systems, 36:13692–13719, 2023. 2
2023
-
[57]
Chain-of-focus prompt- ing: Leveraging sequential visual cues to prompt large au- toregressive vision models
Jiyang Zheng, Jialiang Shen, Yu Yao, Min Wang, Yang Yang, Dadong Wang, and Tongliang Liu. Chain-of-focus prompt- ing: Leveraging sequential visual cues to prompt large au- toregressive vision models. InThe Thirteenth International Conference on Learning Representations, 2025. 1 Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Dista...
2025
-
[58]
Theoretical Analysis (Full) Due to space limitations, we provide some technical analy- sis and the full proof of Theorem 1 here. 8.1. Definitions Note the following definition is given in [31], which we put here for completeness. Definition 4(Approximate realizability[31]).Define the ‘approximate realizability’ of a hypothesis classGon a paired datasetD={...
-
[59]
approximates
Related Works 9.1. Performance Prediction for Model Selection. Due to the prohibitive cost for full-training every single can- didate large models, performance prediction, or transfer- ability estimation, has become an increasing attentive area. However, most of the works in this domain focus on clas- sification tasks only [5, 11, 18, 35, 43, 46, 47, 51, ...
-
[60]
Experimental Details We summarize the key training configurations of pre- training feature alignment (stage 1) and visual instruction tuning (stage 2) in Table 8. 10.1. Training Configurations Configuration Stage-1 Stage-2 learning rate 2e−3 2e−5 learning scheduler cosine cosine warmup ratio 0.03 0.03 global batch size 256 128 training epoch 1 1 max seque...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.