Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi · 2022

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

cs.LG · 2026-02-23 · unverdicted · novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.

EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis

cs.CV · 2025-11-16 · unverdicted · novelty 7.0

EmoVerse is a large open-source dataset enabling interpretable visual emotion analysis via B-A-S triplets, region grounding, and unified CES/DES representations created through an MLLM-driven pipeline.

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

cs.CV · 2026-03-15 · unverdicted · novelty 6.0

Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.

citing papers explorer

Showing 3 of 3 citing papers.

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models cs.LG · 2026-02-23 · unverdicted · none · ref 15
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.
EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis cs.CV · 2025-11-16 · unverdicted · none · ref 19
EmoVerse is a large open-source dataset enabling interpretable visual emotion analysis via B-A-S triplets, region grounding, and unified CES/DES representations created through an MLLM-driven pipeline.
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models cs.CV · 2026-03-15 · unverdicted · none · ref 9
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

fields

years

verdicts

representative citing papers

citing papers explorer