Empirical study shows bidirectional but sensitive relationship between compositionality and long-caption understanding in VLMs, promoted by high-quality grounded data and affected by architectural choices like frozen positional embeddings.
Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10910–10921, 2023
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
Empirical study shows bidirectional but sensitive relationship between compositionality and long-caption understanding in VLMs, promoted by high-quality grounded data and affected by architectural choices like frozen positional embeddings.