The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4representative citing papers
DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
citing papers explorer
-
A Sanity Check on Composed Image Retrieval
The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.
-
A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.
-
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
-
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.