hub Mixed citations

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang · 2023 · cs.CV · arXiv 2305.06500

Mixed citation behavior. Most common role is background (50%).

81 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 81 citing papers arXiv PDF

abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 baseline 5 method 5 dataset 1

citation-polarity summary

background 13 baseline 5 use method 5 support 1 unclear 1 use dataset 1

claims ledger

abstract Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available
baseline averaged performance on three dimensions for evaluating temporal understanding. Model Type Model Language Model Spatial Temporal Overall Acc Rank Acc Rank Acc Rank LLM Flan-T5 [1] Flan-T5-XL 27.32 17 28.56 11 27.65 17 Vicuna [4] Vicuna-7B 28.16 16 29.46 8 28.50 16 LLaMA [5] LLaMA-7B 26.56 18 27.27 13 26.75 18 ImageLLM BLIP-2 [6] Flan-T5-XL 49.74 3 36.71 3 46.35 3 InstructBLIP [10] Flan-T5-XL 57.80 2 38.31 1 52.73 2 InstructBLIP Vicuna [10] Vicuna-7B 58.76 1 38.05 2 53.37 1 LLaV A [8] LLaMA-7B 36
background , Markdown) [29]. However, this modality trans- formation is not only limited by the recognition ability of external tools, but also destroys the inherent 2D physical topological structure and spatial alignment of complex tables, especially those with hierarchical headers [41,50]. Recently, with the rapid development of Multimodal Large Language Models (MLLMs) [1,3,15], the research community has begun to explore unified and end-to-end methods for image-based table reasoning, which aims to prese
baseline It should be noted that we have also tried to design instructions with multiple choice questions, but find that it may beyond the capabilities of current MLLMs to follow complex instructions. We conduct massive experiments to evaluate the zero-shot performance of 30 advanced MLLMs on the 14 subtasks. The evaluated MLLMs include BLIP-2 [25], InstructBLIP [12], MiniGPT-4 [66], PandaGPT [41], Multimodal-GPT [16], VisualGLM-6B [5], ImageBind-LLM [18], VPGTrans [58], LaVIN [35], mPLUG-Owl [52], Octop
baseline Figure 1: CoME-VL uses token entropy analysis to identify complementary layer ranges from multiple vision encoders (SigLIP2 and DINOv3). By composing all SigLIP2 layers (which exhibit high entropy, capturing diverse semantic features) with the low-entropy DINOv3 layers 10-23 (which encode strong spatial features), CoME-VL achieves consistent improvements over the Molmo [15] baseline (single-encoder), averaging +4.9% on visual understanding/generation and +5.4% on grounding tasks. Abstract Recent
method variant 4 2496 48 19968 39 5985M 1553G 28.3 / 65.3 65.9 variant 5 2816 64 11264 44 6095M 1589G 21.6 / 61.4 66.2 variant 6 2496 80 9984 39 5985M 1564G 16.9 / 60.1 66.2 Table 11. Comparison of hyperparameters in InternViT-6B. The throughput (img/s) and GFLOPs are measured at 224×224 in- put resolution, with a batch size of 1 or 128 on a single A100 GPU. Flash Attention [35] and bf16 precision are used during testing. "zs IN" denotes the zero-shot top-1 accuracy on the ImageNet-1K validation set [3
method Details on Simulation-Free Training of Flows Following (Lipman et al., 2023), to see that ut(z) generates pt, we note that the continuity equation provides a necessary and sufficient condition (Villani, 2008): d dt pt(x) + ∇ · [pt(x)vt(x)] = 0 ↔ vt generates probability density path pt. (26) Therefore it suffices to show that −∇ · [ut(z)pt(z)] = −∇ · [Eϵ∼N(0,I)ut(z|ϵ) pt(z|ϵ) pt(z) pt(z)] (27) = Eϵ∼N(0,I) − ∇ · [ut(z|ϵ)pt(z|ϵ)] (28) = Eϵ∼N(0,I) d dt pt(z|ϵ) = d dt pt(z), (29) where we used the c

co-cited works

representative citing papers

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

cs.CV · 2024-07-22 · accept · novelty 8.0

LongVideoBench is a new benchmark for long-context video-language understanding that uses referring reasoning questions on hour-long videos to challenge multimodal models.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.

Brain-IT-VQA: From Brain Signals to Answers

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Brain-IT-VQA decodes visual question answers from fMRI using a transformer to extract language tokens and introduces the NSD-VQA benchmark with 20 controlled questions per image across 20 categories.

Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.

SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

cs.MM · 2026-04-16 · unverdicted · novelty 7.0

Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.

TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

cs.AI · 2026-04-04 · conditional · novelty 7.0

TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments

cs.CV · 2025-10-05 · unverdicted · novelty 7.0

APO framework aligns multi-source MLLM reasoning under concept drift by using inter-model divergences as negative constraints via supervised bootstrapping and multi-negative Plackett-Luce optimization, with a 7B model outperforming proprietary sources on chest X-ray tasks and a new CXR-MAX benchmark

Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching

cs.LG · 2025-09-26 · conditional · novelty 7.0

Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

Toward Generalizable Forgery Detection and Reasoning

cs.CV · 2025-03-27 · unverdicted · novelty 7.0

FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

cs.CV · 2024-10-22 · accept · novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

cs.CV · 2024-06-14 · unverdicted · novelty 7.0

Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

cs.CV · 2023-10-23 · unverdicted · novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

cs.CL · 2023-07-30 · unverdicted · novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

Evaluating Object Hallucination in Large Vision-Language Models

cs.CV · 2023-05-17 · accept · novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

cs.MM · 2026-05-11 · unverdicted · novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

cs.AI · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.

citing papers explorer

Showing 1 of 1 citing paper after filters.

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks cs.AI · 2025-05-26 · unreviewed · ref 8 · internal anchor

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer