Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

A Sanity Check on Composed Image Retrieval

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.

A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

cs.CV · 2026-04-22 · conditional · novelty 6.0

R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.

citing papers explorer

Showing 4 of 4 citing papers.

A Sanity Check on Composed Image Retrieval cs.CV · 2026-04-14 · unverdicted · none · ref 20
The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.
A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions cs.CV · 2025-03-10 · unverdicted · none · ref 14
DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 33
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs cs.CV · 2026-04-22 · conditional · none · ref 24
R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

fields

years

verdicts

representative citing papers

citing papers explorer