LongVideoBench is a new benchmark for long-context video-language understanding that uses referring reasoning questions on hour-long videos to challenge multimodal models.
hub Mixed citations
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Mixed citation behavior. Most common role is background (50%).
abstract
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available
- baseline averaged performance on three dimensions for evaluating temporal understanding. Model Type Model Language Model Spatial Temporal Overall Acc Rank Acc Rank Acc Rank LLM Flan-T5 [1] Flan-T5-XL 27.32 17 28.56 11 27.65 17 Vicuna [4] Vicuna-7B 28.16 16 29.46 8 28.50 16 LLaMA [5] LLaMA-7B 26.56 18 27.27 13 26.75 18 ImageLLM BLIP-2 [6] Flan-T5-XL 49.74 3 36.71 3 46.35 3 InstructBLIP [10] Flan-T5-XL 57.80 2 38.31 1 52.73 2 InstructBLIP Vicuna [10] Vicuna-7B 58.76 1 38.05 2 53.37 1 LLaV A [8] LLaMA-7B 36
- background , Markdown) [29]. However, this modality trans- formation is not only limited by the recognition ability of external tools, but also destroys the inherent 2D physical topological structure and spatial alignment of complex tables, especially those with hierarchical headers [41,50]. Recently, with the rapid development of Multimodal Large Language Models (MLLMs) [1,3,15], the research community has begun to explore unified and end-to-end methods for image-based table reasoning, which aims to prese
- baseline It should be noted that we have also tried to design instructions with multiple choice questions, but find that it may beyond the capabilities of current MLLMs to follow complex instructions. We conduct massive experiments to evaluate the zero-shot performance of 30 advanced MLLMs on the 14 subtasks. The evaluated MLLMs include BLIP-2 [25], InstructBLIP [12], MiniGPT-4 [66], PandaGPT [41], Multimodal-GPT [16], VisualGLM-6B [5], ImageBind-LLM [18], VPGTrans [58], LaVIN [35], mPLUG-Owl [52], Octop
- baseline Figure 1: CoME-VL uses token entropy analysis to identify complementary layer ranges from multiple vision encoders (SigLIP2 and DINOv3). By composing all SigLIP2 layers (which exhibit high entropy, capturing diverse semantic features) with the low-entropy DINOv3 layers 10-23 (which encode strong spatial features), CoME-VL achieves consistent improvements over the Molmo [15] baseline (single-encoder), averaging +4.9% on visual understanding/generation and +5.4% on grounding tasks. Abstract Recent
- method variant 4 2496 48 19968 39 5985M 1553G 28.3 / 65.3 65.9 variant 5 2816 64 11264 44 6095M 1589G 21.6 / 61.4 66.2 variant 6 2496 80 9984 39 5985M 1564G 16.9 / 60.1 66.2 Table 11. Comparison of hyperparameters in InternViT-6B. The throughput (img/s) and GFLOPs are measured at 224×224 in- put resolution, with a batch size of 1 or 128 on a single A100 GPU. Flash Attention [35] and bf16 precision are used during testing. "zs IN" denotes the zero-shot top-1 accuracy on the ImageNet-1K validation set [3
- method Details on Simulation-Free Training of Flows Following (Lipman et al., 2023), to see that ut(z) generates pt, we note that the continuity equation provides a necessary and sufficient condition (Villani, 2008): d dt pt(x) + ∇ · [pt(x)vt(x)] = 0 ↔ vt generates probability density path pt. (26) Therefore it suffices to show that −∇ · [ut(z)pt(z)] = −∇ · [Eϵ∼N(0,I)ut(z|ϵ) pt(z|ϵ) pt(z) pt(z)] (27) = Eϵ∼N(0,I) − ∇ · [ut(z|ϵ)pt(z|ϵ)] (28) = Eϵ∼N(0,I) d dt pt(z|ϵ) = d dt pt(z), (29) where we used the c
co-cited works
representative citing papers
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.
Brain-IT-VQA decodes visual question answers from fMRI using a transformer to extract language tokens and introduces the NSD-VQA benchmark with 20 controlled questions per image across 20 categories.
CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.
A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
APO framework aligns multi-source MLLM reasoning under concept drift by using inter-model divergences as negative constraints via supervised bootstrapping and multi-negative Plackett-Luce optimization, with a 7B model outperforming proprietary sources on chest X-ray tasks and a new CXR-MAX benchmark
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.