Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Indexing Multimodal Language Models for Large-scale Image Retrieval

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Multimodal LLMs act as training-free similarity estimators for instance-level image retrieval by converting next-token probabilities from image-pair prompts into scores, combined with efficient indexing for scalability.

Relational Visual Similarity

cs.CV · 2025-12-08 · unverdicted · novelty 7.0

A vision-language model is finetuned on 114k anonymized relational captions to embed images by their underlying structural correspondences instead of visible attributes.

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

cs.CV · 2025-12-01 · unverdicted · novelty 7.0

AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.

Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.

citing papers explorer

Showing 4 of 4 citing papers.

Indexing Multimodal Language Models for Large-scale Image Retrieval cs.CV · 2026-04-14 · unverdicted · none · ref 33
Multimodal LLMs act as training-free similarity estimators for instance-level image retrieval by converting next-token probabilities from image-pair prompts into scores, combined with efficient indexing for scalability.
Relational Visual Similarity cs.CV · 2025-12-08 · unverdicted · none · ref 46
A vision-language model is finetuned on 114k anonymized relational captions to embed images by their underlying structural correspondences instead of visible attributes.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models cs.CV · 2025-12-01 · unverdicted · none · ref 36
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models cs.CV · 2026-04-14 · unverdicted · none · ref 41
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.

Improved baselines with visual instruction tuning

fields

years

verdicts

representative citing papers

citing papers explorer