Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.
Reproducible scal- ing laws for contrastive language-image learning
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3roles
background 1polarities
background 1representative citing papers
BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumonia detection.
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
citing papers explorer
-
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.
-
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumonia detection.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.