InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Pith reviewed 2026-05-13 02:07 UTC · model grok-4.3
The pith
Instruction tuning on diverse vision-language datasets creates models with strong zero-shot generalization to new tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that instruction tuning applied to pretrained vision-language models, using a broad set of tasks converted to instruction format together with an instruction-aware Query Transformer for tailored feature extraction, produces models with wide competence. Training on 13 held-in datasets yields state-of-the-art zero-shot performance on all 13 held-out datasets and leads to top results when the models are later fine-tuned on individual downstream tasks.
What carries the argument
The instruction-aware Query Transformer, which adapts visual feature extraction to the given text instruction so that the model receives only the most relevant image information for the current task.
If this is right
- Instruction tuning on a moderate number of datasets suffices to produce broad zero-shot ability across many vision-language tasks.
- Further fine-tuning after instruction tuning delivers high accuracy on specific tasks such as visual question answering.
- A single model can handle a wide range of multimodal instructions in a unified manner.
- The resulting models support open use for additional applications and research.
Where Pith is reading between the lines
- The same tuning recipe could be tried on other base vision-language models to improve their instruction-following without starting from scratch.
- Testing the models on novel combinations of capabilities not seen together in training would reveal how far the generalization extends.
- Such systems might support more fluid interactive applications that combine image understanding with natural language instructions.
Load-bearing premise
The held-out datasets contain no overlap or leakage with the held-in training data so that gains reflect genuine generalization from instruction tuning rather than memorization.
What would settle it
Showing substantial data overlap between any held-in and held-out dataset or demonstrating that the performance gains vanish when testing on a fresh vision-language task absent from the original collection of 26 datasets.
read the original abstract
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InstructBLIP by applying instruction tuning to pretrained BLIP-2 models. It collects and converts 26 public vision-language datasets into instruction format, trains on a 13-dataset held-in split, and reports state-of-the-art zero-shot results on the complementary 13 held-out datasets, outperforming BLIP-2 and larger Flamingo models. Additional gains are shown after task-specific fine-tuning (e.g., 90.7% on ScienceQA), and the models are open-sourced.
Significance. If the held-out evaluation is uncontaminated, the work supplies the first large-scale empirical demonstration that instruction tuning yields broad zero-shot generalization in vision-language models, analogous to its success in language-only models. The open release of code and checkpoints is a concrete contribution that enables follow-up research.
major comments (2)
- [Section 4 and Section 3.2] Section 4 (Experiments) and the dataset description in Section 3.2: the central zero-shot SOTA claim on the 13 held-out datasets is load-bearing for the paper's generalization narrative, yet no explicit check, deduplication step, or overlap statistics are reported for images, captions, or questions that may be shared across the held-in/held-out split (many constituent datasets draw from COCO, Flickr30k, and Visual Genome). Without such verification the reported gains over BLIP-2 could partly reflect leakage rather than instruction-tuning benefits.
- [Section 3.3] Section 3.3 (Instruction-aware Query Transformer): the architectural modification is presented as key to conditioning on instructions, but the ablation isolating its contribution versus standard Q-Former + instruction tuning is not shown; the performance delta could be driven primarily by the larger instruction-tuning corpus rather than the new module.
minor comments (2)
- [Table 2] Table 2 and the held-out dataset list: several datasets share visual sources; adding a column or footnote indicating the source image collections would improve transparency.
- [Section 3.2] The instruction templates used for each dataset are described only at a high level; releasing the exact templates (or a supplementary file) would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The two major comments identify important gaps in verification and ablation that we will address through targeted revisions to strengthen the claims on generalization and architectural contributions.
read point-by-point responses
-
Referee: [Section 4 and Section 3.2] Section 4 (Experiments) and the dataset description in Section 3.2: the central zero-shot SOTA claim on the 13 held-out datasets is load-bearing for the paper's generalization narrative, yet no explicit check, deduplication step, or overlap statistics are reported for images, captions, or questions that may be shared across the held-in/held-out split (many constituent datasets draw from COCO, Flickr30k, and Visual Genome). Without such verification the reported gains over BLIP-2 could partly reflect leakage rather than instruction-tuning benefits.
Authors: We acknowledge that explicit overlap verification was not reported. The held-in/held-out split was constructed at the dataset level to ensure the 13 held-out tasks were unseen during instruction tuning, even though some source image collections (e.g., COCO) are shared across multiple VL benchmarks. To address the concern directly, we will add a new paragraph to Section 3.2 that (1) lists all constituent datasets and their original sources, (2) reports the percentage of image overlap between the held-in and held-out collections, and (3) discusses why any residual overlap is unlikely to explain the observed zero-shot gains, given that the held-out evaluation uses entirely different instructions and question formats. If substantial leakage is found, we will also report results after removing overlapping images. revision: yes
-
Referee: [Section 3.3] Section 3.3 (Instruction-aware Query Transformer): the architectural modification is presented as key to conditioning on instructions, but the ablation isolating its contribution versus standard Q-Former + instruction tuning is not shown; the performance delta could be driven primarily by the larger instruction-tuning corpus rather than the new module.
Authors: We agree that an explicit ablation isolating the instruction-aware Q-Former from the effect of the larger instruction-tuning corpus would be valuable. The module was introduced precisely to make the visual queries instruction-dependent, which standard Q-Former does not do. In the revision we will add an ablation in Section 4 that trains a baseline using the original (non-instruction-aware) Q-Former architecture on the identical 13 held-in datasets and instruction format, then compares its zero-shot performance on the held-out sets against the full InstructBLIP model. This will clarify the incremental benefit of the architectural change. revision: yes
Circularity Check
No circularity: purely empirical training and held-out evaluation with no derivation chain
full rationale
The paper describes collecting 26 public datasets, splitting them into 13 held-in and 13 held-out, instruction-tuning BLIP-2 variants on the held-in set, and reporting zero-shot results on the held-out set. No equations, first-principles derivations, uniqueness theorems, or fitted parameters are presented as generating predictions. The central claim is an empirical performance comparison, not a reduction of outputs to inputs by construction. Any risk of image/task overlap is an experimental-validity issue, not a circularity in the reported chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 13 held-in and 13 held-out datasets are disjoint with no data leakage.
Forward citations
Cited by 60 Pith papers
-
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
LongVideoBench is a new benchmark for long-context video-language understanding that uses referring reasoning questions on hour-long videos to challenge multimodal models.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment
A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.
-
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
-
SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters
Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.
-
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...
-
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
-
Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments
APO framework aligns multi-source MLLM reasoning under concept drift by using inter-model divergences as negative constraints via supervised bootstrapping and multi-negative Plackett-Luce optimization, with a 7B model...
-
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
-
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks
A new benchmark dataset and evaluation framework for testing multimodal AI agents on real field work tasks derived from on-site data and worker interviews.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...
-
Toward Generalizable Forgery Detection and Reasoning
FakeReasoning is an MLLM-based framework for unified forgery detection and reasoning on AI-generated images, supported by the new MMFR-Dataset of 120K images and 378K annotations across 10 generators.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
-
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Mitigating Multimodal Hallucination via Phase-wise Self-reward
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
-
Counting to Four is still a Chore for VLMs
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
-
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
-
SAM 3D: 3Dfy Anything in Images
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
-
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefic...
-
TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs
TARS uses token-adaptive min-max preference optimization and FFT-based spectral regularization to cut hallucination rates in MLLMs from 26.4% to 13.2% with only 4.8k samples, outperforming standard DPO and larger data...
-
UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding
UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.
-
SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding
SVL pretraining enables SNNs to reach 85.4% top-1 accuracy on zero-shot 3D classification while outperforming prior SNNs on detection, segmentation, and action recognition with added open-world QA capability.
-
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Presents ChatSearch dataset and ChatSearcher generative model for conversational image retrieval on open-domain images, claiming superior performance on the new dataset and competitive results elsewhere.
-
ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization
ForgeryGPT integrates a forgery localization expert and mask encoder into an LLM for pixel-level forgery detection, localization, and explainable output via three-stage training on custom mask-text and instruction datasets.
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
-
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
LMMS-EVAL delivers a standardized multimodal evaluation framework with lite and live variants that target the trade-offs among coverage, cost, and zero contamination.
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
-
TempCompass: Do Video LLMs Really Understand Videos?
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
-
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
-
SALMONN: Towards Generic Hearing Abilities for Large Language Models
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
-
Aligning Large Multimodal Models with Factually Augmented RLHF
Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
-
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
-
DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models
Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.
-
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-f...
-
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
-
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
-
Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis
CRAVE is a new framework that clusters retrieved text and image evidence into narratives and uses an LLM judge to produce explained fact-checking verdicts.
-
Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model
Q-Agent uses CoT decomposition on a fine-tuned MLLM for multi-degradation perception plus IQA-driven greedy selection of restoration algorithms to claim better performance than All-in-One IR models.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
Reference graph
Works this paper leans on
-
[1]
https://openai.com/blog/chatgpt, 2023
Chatgpt. https://openai.com/blog/chatgpt, 2023. 9
work page 2023
-
[2]
https://github.com/lm-sys/FastChat, 2023
Vicuna. https://github.com/lm-sys/FastChat, 2023. 3, 6, 9
work page 2023
-
[3]
nocaps: novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In ICCV, pages 8948–8957, 2019. 3, 16
work page 2019
-
[4]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...
work page 2022
-
[5]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[6]
Unifying vision-and-language tasks via text generation
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779, 2021. 1
-
[7]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y . Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jef...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose M. F. Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, 2017. 3, 16
work page 2017
-
[9]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...
work page 2023
-
[10]
EV A: Exploring the Limits of Masked Visual Repre- sentation Learning at Scale 2022
Yuxin Fang, Wen Wang, Binhui Xie, Quan-Sen Sun, Ledell Yu Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. ArXiv, abs/2211.07636, 2022. 6
-
[11]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, July 2017. 3, 16
work page 2017
-
[12]
Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P
Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018. 3, 16
work page 2018
-
[13]
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. ArXiv, abs/2212.09689, 2022. 9
-
[14]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 9
work page 2022
-
[15]
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, and Jiebo Luo. Promptcap: Prompt- guided task-aware image captioning, 2023. 9
work page 2023
-
[16]
Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 3, 16
work page 2019
-
[17]
Deep visual-semantic alignments for generating image descriptions
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. 16
work page 2015
-
[18]
The hateful memes challenge: Detecting hate speech in multimodal memes
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal memes. In NeurIPS, 2020. 3, 16 10
work page 2020
-
[19]
Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven C. H. Hoi. Lavis: A library for language-vision intelligence, 2022. 6
work page 2022
-
[20]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 1, 3, 4, 6, 9, 16
work page 2023
-
[21]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022. 5, 16
work page 2022
-
[22]
Align before fuse: Vision and language representation learning with momentum distillation
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021. 5
work page 2021
-
[23]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 3, 16
work page 2014
-
[24]
Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023. 3
work page 2023
-
[25]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023. 3, 7, 9, 16
work page 2023
-
[26]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. 6
work page 2019
-
[27]
12-in-1: Multi-task vision and language representation learning
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12-in-1: Multi-task vision and language representation learning. In CVPR, 2020. 1
work page 2020
-
[28]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022. 3, 16
work page 2022
-
[29]
Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In NeurIPS Track on Datasets and Benchmarks, 2021. 3, 16
work page 2021
-
[30]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019. 3, 16
work page 2019
-
[31]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 3, 16
work page 2019
-
[32]
Large-scale pretraining for visual dialog: A simple state-of-the-art baseline
Vishvak Murahari, Dhruv Batra, Devi Parikh, and Abhishek Das. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, ECCV, 2020. 6, 7
work page 2020
-
[33]
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. 7, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020. 3, 6
work page 2020
-
[35]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, An- toine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica...
-
[36]
A- okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A- okvqa: A benchmark for visual question answering using world knowledge. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,ECCV, 2022. 3, 16
work page 2022
-
[37]
Prompting large language models with answer heuristics for knowledge-based visual question answering
Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. Prompting large language models with answer heuristics for knowledge-based visual question answering. Computer Vision and Pattern Recognition (CVPR), 2023. 9
work page 2023
-
[38]
Textcaps: a dataset for image captioningwith reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioningwith reading comprehension. 2020. 3, 16
work page 2020
-
[39]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019. 3, 16
work page 2019
- [40]
-
[41]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3, 6, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Lawrence Zitnick, and Devi Parikh
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4566–4575, 2015. 6
work page 2015
-
[43]
Git: A generative image-to-text transformer for vision and language, 2022
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language, 2022. 9
work page 2022
-
[44]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. ArXiv, abs/2212.10560, 2022. 9
work page internal anchor Pith review arXiv 2022
-
[45]
Super- NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parma...
work page 2022
-
[46]
Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. InICLR, 2022. 1, 5, 8, 9
work page 2022
-
[47]
Video question answering via gradually refined attention over appearance and motion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia, page 1645–1653, 2017. 3, 16
work page 2017
-
[48]
Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren
Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. ArXiv, abs/2212.10773, 2022. 9
-
[49]
Just ask: Learning to answer questions from millions of narrated videos
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pages 1686–1697, 2021. 3, 6, 16
work page 2021
-
[50]
mplug-owl: Modularization empowers large language models with multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yi Zhou, Junyan Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qiang Qi, Ji Chao Zhang, and Feiyan Huang. mplug-owl: Modularization empowers large language models with multimodality. 2023. 9
work page 2023
-
[51]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 2014. 3, 16
work page 2014
-
[52]
The Girl with the Pearl Earring
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023. 7, 9 12 A Broader Impact InstructBLIP uses off-the-shelf frozen LLMs. Therefore it inherits some of the shortcomings from the original LLMs, such as hallucinating ungrounded text or generating o...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.