Systematic zero-shot benchmarking of open-source VLMs on multimodal grocery product retrieval shows data quality outperforms scale, introduces semantic power density as an efficiency metric, and identifies a persistent top-1 precision gap.
arXiv preprint arXiv:2309.01859 (2023)
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
citing papers explorer
-
What Matters for Grocery Product Retrieval with Open Source Vision Language Models
Systematic zero-shot benchmarking of open-source VLMs on multimodal grocery product retrieval shows data quality outperforms scale, introduces semantic power density as an efficiency metric, and identifies a persistent top-1 precision gap.
-
Multilingual Vision-Language Models, A Survey
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.