Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.
Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.NIPS, 37:87310–87356
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.