Multimodal LLMs act as training-free similarity estimators for instance-level image retrieval by converting next-token probabilities from image-pair prompts into scores, combined with efficient indexing for scalability.
Improved baselines with visual instruction tuning
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4verdicts
UNVERDICTED 4representative citing papers
A vision-language model is finetuned on 114k anonymized relational captions to embed images by their underlying structural correspondences instead of visible attributes.
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.
citing papers explorer
-
Indexing Multimodal Language Models for Large-scale Image Retrieval
Multimodal LLMs act as training-free similarity estimators for instance-level image retrieval by converting next-token probabilities from image-pair prompts into scores, combined with efficient indexing for scalability.
-
Relational Visual Similarity
A vision-language model is finetuned on 114k anonymized relational captions to embed images by their underlying structural correspondences instead of visible attributes.
-
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
-
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.