CPI ranks image-text pairs using phrase-level sensitivity scores from nonce substitutions to improve compositional performance in VL pretraining, achieving gains on relation benchmarks with a 50% data subset.
Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining
CPI ranks image-text pairs using phrase-level sensitivity scores from nonce substitutions to improve compositional performance in VL pretraining, achieving gains on relation benchmarks with a 50% data subset.