SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilingual understanding at scales from 86M to 1B parameters.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at various sizes and resolutions.
citing papers explorer
-
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilingual understanding at scales from 86M to 1B parameters.
-
PaliGemma 2: A Family of Versatile VLMs for Transfer
PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at various sizes and resolutions.