Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

A Sanity Check on Composed Image Retrieval

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.

Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

cs.CV · 2025-05-26 · unverdicted · novelty 6.0

VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

citing papers explorer

Showing 4 of 4 citing papers.

A Sanity Check on Composed Image Retrieval cs.CV · 2026-04-14 · unverdicted · none · ref 21
The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos cs.CV · 2026-04-20 · unverdicted · none · ref 22
EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction cs.CV · 2025-05-26 · unverdicted · none · ref 42
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs cs.CV · 2026-05-04 · unverdicted · none · ref 34
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

fields

years

verdicts

representative citing papers

citing papers explorer