LongVideoBench is a new benchmark for long-context video-language understanding that uses referring reasoning questions on hour-long videos to challenge multimodal models.
hub Mixed citations
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Mixed citation behavior. Most common role is background (50%).
abstract
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available
- baseline averaged performance on three dimensions for evaluating temporal understanding. Model Type Model Language Model Spatial Temporal Overall Acc Rank Acc Rank Acc Rank LLM Flan-T5 [1] Flan-T5-XL 27.32 17 28.56 11 27.65 17 Vicuna [4] Vicuna-7B 28.16 16 29.46 8 28.50 16 LLaMA [5] LLaMA-7B 26.56 18 27.27 13 26.75 18 ImageLLM BLIP-2 [6] Flan-T5-XL 49.74 3 36.71 3 46.35 3 InstructBLIP [10] Flan-T5-XL 57.80 2 38.31 1 52.73 2 InstructBLIP Vicuna [10] Vicuna-7B 58.76 1 38.05 2 53.37 1 LLaV A [8] LLaMA-7B 36
- background , Markdown) [29]. However, this modality trans- formation is not only limited by the recognition ability of external tools, but also destroys the inherent 2D physical topological structure and spatial alignment of complex tables, especially those with hierarchical headers [41,50]. Recently, with the rapid development of Multimodal Large Language Models (MLLMs) [1,3,15], the research community has begun to explore unified and end-to-end methods for image-based table reasoning, which aims to prese
- baseline It should be noted that we have also tried to design instructions with multiple choice questions, but find that it may beyond the capabilities of current MLLMs to follow complex instructions. We conduct massive experiments to evaluate the zero-shot performance of 30 advanced MLLMs on the 14 subtasks. The evaluated MLLMs include BLIP-2 [25], InstructBLIP [12], MiniGPT-4 [66], PandaGPT [41], Multimodal-GPT [16], VisualGLM-6B [5], ImageBind-LLM [18], VPGTrans [58], LaVIN [35], mPLUG-Owl [52], Octop
- baseline Figure 1: CoME-VL uses token entropy analysis to identify complementary layer ranges from multiple vision encoders (SigLIP2 and DINOv3). By composing all SigLIP2 layers (which exhibit high entropy, capturing diverse semantic features) with the low-entropy DINOv3 layers 10-23 (which encode strong spatial features), CoME-VL achieves consistent improvements over the Molmo [15] baseline (single-encoder), averaging +4.9% on visual understanding/generation and +5.4% on grounding tasks. Abstract Recent
- method variant 4 2496 48 19968 39 5985M 1553G 28.3 / 65.3 65.9 variant 5 2816 64 11264 44 6095M 1589G 21.6 / 61.4 66.2 variant 6 2496 80 9984 39 5985M 1564G 16.9 / 60.1 66.2 Table 11. Comparison of hyperparameters in InternViT-6B. The throughput (img/s) and GFLOPs are measured at 224×224 in- put resolution, with a batch size of 1 or 128 on a single A100 GPU. Flash Attention [35] and bf16 precision are used during testing. "zs IN" denotes the zero-shot top-1 accuracy on the ImageNet-1K validation set [3
- method Details on Simulation-Free Training of Flows Following (Lipman et al., 2023), to see that ut(z) generates pt, we note that the continuity equation provides a necessary and sufficient condition (Villani, 2008): d dt pt(x) + ∇ · [pt(x)vt(x)] = 0 ↔ vt generates probability density path pt. (26) Therefore it suffices to show that −∇ · [ut(z)pt(z)] = −∇ · [Eϵ∼N(0,I)ut(z|ϵ) pt(z|ϵ) pt(z) pt(z)] (27) = Eϵ∼N(0,I) − ∇ · [ut(z|ϵ)pt(z|ϵ)] (28) = Eϵ∼N(0,I) d dt pt(z|ϵ) = d dt pt(z), (29) where we used the c
co-cited works
representative citing papers
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
APO framework aligns multi-source MLLM reasoning under concept drift by using inter-model divergences as negative constraints via supervised bootstrapping and multi-negative Plackett-Luce optimization, with a 7B model outperforming proprietary sources on chest X-ray tasks and a new CXR-MAX benchmark
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
A new benchmark dataset and evaluation framework for testing multimodal AI agents on real field work tasks derived from on-site data and worker interviews.
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
citing papers explorer
-
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
LongVideoBench is a new benchmark for long-context video-language understanding that uses referring reasoning questions on hour-long videos to challenge multimodal models.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment
A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.
-
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
-
SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters
Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.
-
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
-
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
-
Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments
APO framework aligns multi-source MLLM reasoning under concept drift by using inter-model divergences as negative constraints via supervised bootstrapping and multi-negative Plackett-Luce optimization, with a 7B model outperforming proprietary sources on chest X-ray tasks and a new CXR-MAX benchmark
-
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
-
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks
A new benchmark dataset and evaluation framework for testing multimodal AI agents on real field work tasks derived from on-site data and worker interviews.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
-
Toward Generalizable Forgery Detection and Reasoning
FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
-
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Mitigating Multimodal Hallucination via Phase-wise Self-reward
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
-
Counting to Four is still a Chore for VLMs
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
-
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
-
SAM 3D: 3Dfy Anything in Images
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
-
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefics2, InstructBLIP and Qwen2.5-VL with >96% throughput.
-
TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs
TARS uses token-adaptive min-max preference optimization and FFT-based spectral regularization to cut hallucination rates in MLLMs from 26.4% to 13.2% with only 4.8k samples, outperforming standard DPO and larger data-augmented baselines.
-
UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding
UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.
-
SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding
SVL pretraining enables SNNs to reach 85.4% top-1 accuracy on zero-shot 3D classification while outperforming prior SNNs on detection, segmentation, and action recognition with added open-world QA capability.
-
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Presents ChatSearch dataset and ChatSearcher generative model for conversational image retrieval on open-domain images, claiming superior performance on the new dataset and competitive results elsewhere.
-
ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization
ForgeryGPT integrates a forgery localization expert and mask encoder into an LLM for pixel-level forgery detection, localization, and explainable output via three-stage training on custom mask-text and instruction datasets.
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
-
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
LMMS-EVAL delivers a standardized multimodal evaluation framework with lite and live variants that target the trade-offs among coverage, cost, and zero contamination.
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
-
TempCompass: Do Video LLMs Really Understand Videos?
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
-
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
-
SALMONN: Towards Generic Hearing Abilities for Large Language Models
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
-
Aligning Large Multimodal Models with Factually Augmented RLHF
Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
-
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.